Title: Exploring Blog Networks
1Exploring Blog Networks
- Patterns and a Model for Information Propagation
(As seen at SIAM- Data Mining 2007)
Mary McGlohon In collaboration with Jure
Leskovec, Christos Faloutsos Natalie Glance,
Matthew Hurst Sandia National Labs- July 6, 2007
2Long-term Goals
- How does information on the Web propagate?
- With what pattern do ideas catch on, diffuse, and
decrease in popularity? - Can we build a model for this propagation?
3Why blogs?
- Blogs are a widely used medium of information for
many topics and have become an important mode of
communication. - Blogs cite one another, creating a record of how
information and ideas spread through a social
network. - This record is publicly available.
4Why do we care?
- Understanding how the blog network works is
important for - Social issues Political mapping, social trends
and change, reactions to mass media. - Economic issues Marketing, predicting commercial
success, discovering links between companies.
Example blogs in the 2004 election. Adamic,
Glance 2005
5Immediate Goals
- Temporal questions Does popularity have
half-life? Is there periodicity? - Topological questions What topological patterns
do posts and blogs follow? What shapes do
cascades take on? Stars? Chains? Something
else? - Generative model Can we build a generative model
that mimics properties of cascades?
6Outline
- Motivation
- Preliminaries
- Concepts and terminology
- Data
- Temporal Observations
- Topological Observations
- Cascade Generation Model
- Discussion Conclusions
7What is a blog?
- A blog is a frequently-updated webpage.
- A blogs author updates the blog using posts.
- Each post has a permanent hyperlink, and may
contain links to other blog posts.
slashdot
boingboing
8What is a blog?
- A blog is a frequently-updated webpage.
- A blogs author updates the blog using posts.
- Each post has a permanent hyperlink, and may
contain links to other blog posts.
The iPhone is here, hooray!
slashdot
boingboing
9What is a blog?
- A blog is a frequently-updated webpage.
- A blogs author updates the blog using posts.
- Each post has a permanent hyperlink, and may
contain links to other blog posts.
The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
slashdot
boingboing
10What is a blog?
- A blog is a frequently-updated webpage.
- A blogs author updates the blog using posts.
- Each post has a permanent hyperlink, and may
contain links to other blog posts.
The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
Here Boingboing says theyre not buying an
iPhone. Theyre just jealous.
slashdot
boingboing
11From blogs to networks
slashdot
boingboing
MichelleMalkin
Dlisted
slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
Non-trivial vs. trivial cascades Stars vs.
chains Nodes a,b,c,d are cascade initiators e is
a connector
Cascades
12From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades
MichelleMalkin
Dlisted
Cascades
13From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades Cascade
initiators are first sources of information We
also have stars and chains
MichelleMalkin
Dlisted
Cascades
14Dataset (Nielsen Buzzmetrics)
- Gathered from August-September 2005
- Used set of 44,362 blogs, traced cascades
- 2.4 million posts, 5 million out-links, 245,404
blog-to-blog links
Number of posts
Time 1 day
15Outline
- Motivation
- Preliminaries
- Concepts and terminology
- Data
- Temporal Observations
- Does blog traffic behave periodically?
- How does popularity change over time?
- Topological Observations
- Cascade Generation Model
- Discussion Conclusions
- Future Work
16Temporal Observations
- Does blog traffic behave periodically?
- Posts have weekend effect, less traffic on
Saturday/Sunday.
17Temporal Observations
- Does blog traffic behave periodically?
- Monday appears to compensate for this behavior,
but it is not actually the case. - We normalize data countnorm count / pd
- where pd is percentage of links on that day.
Number in-links (log)
Number in-links (log)
Monday post dropoff- days after post
Same data, normalized
18Temporal Observations
- How does post popularity change over time?
- Post popularity dropoff follows a power law
identical to that found in communication response
times in Vazquez2006.
Observation 1 The probability that a post
written at time tp acquires a link at time tp ?
is p(tp?) ? ?1.5
Number of in-links
Days after post
Cascades
19Outline
- Motivation
- Preliminaries
- Temporal Observations
- Does blog traffic behave periodically?
- How does post popularity change over time?
- Topological Observations
- What are graph properties for blog networks?
- What shapes do cascades take on? Stars, chains,
or something else? - Cascade Generation Model
- Discussion Conclusions
- Future Work
20Topological Observations
- What graph properties does the blog network
exhibit?
21Topological Observations
- What graph properties does the blog network
exhibit? How connected? - 44,356 nodes, 122,153 edges
- Half of blogs belong to largest connected
component.
22Topological Observations
- What power laws does the blog network exhibit?
Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
23Topological Observations
- How are blog in- and out-degree related?
In-links and out-links are not correlated.
(correlation coefficient 0.16)
Number of blog out-links (log scale)
Number of blog in-links (log scale)
24Topological Observations
What graph properties does the post network
exhibit?
25Topological Observations
What graph properties does the post network
exhibit?
- Very sparsely connected 98 of posts are
isolated.
26Topological Observations
What power laws does the post network exhibit?
- Both in-and out-degree follow power laws
- In-degree has PL exponent -2.15, out-degree has
PL exponent -2.95.
Count
Count
Post in-degree
Post out-degree
27Topological Observations
How do we measure how information flows through
the network?
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
a
b
c
d
e
28Topological Observations
How do we measure how information flows through
the network?
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
- Follow in-links.
a
a
b
b
c
c
d
d
e
e
29Topological Observations
How do we measure how information flows through
the network?
- We gather cascades using the following procedure
- Find all initiators (out-degree 0).
- Follow in-links.
- Produces directed acyclic graph.
a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
30Topological Observations
How do we measure how information flows through
the network?
- Common cascade shapes are extracted using
algorithms in Leskovec2006.
31Topological Observations
How do we measure how information flows through
the network?
- Number of edges increases linearally with
cascade size, while effective diameter increases
logarithmically, suggesting tree-like structures.
Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
32Topological Observations
How do we measure how information flows through
the network?
- We work with a bag of cascades each cascade is
a disconnected subgraph. - We now explore some graph properties of cascades.
33Topological Observations
What graph properties do cascades exhibit?
- As before, in- and out-degree in bag of cascades
follow power laws.
Count
Count
Cascade node out-degree
Cascade node in-degree
34Topological Observations
What graph properties do cascades exhibit?
- Cascade size distributions also follow power law.
35Topological Observations
What graph properties do cascades exhibit?
- Cascade size distributions also follow power law.
Observation 2 The probability of observing a
cascade on n nodes follows a Zipf
distribution p(n) ? n-2
Count
Cascade size ( of nodes)
36Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
37Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
Count
Count
Size of chain ( nodes)
Size of star ( nodes)
38Outline
- Motivation
- Preliminaries
- Temporal Observations
- Topological Observations
- What are graph properties for blog networks?
- What shapes and patterns do cascades take on?
- Cascade Generation Model
- Epidemiological Background
- Proposed Model
- Experimental Validation
- Discussion Conclusions
- Future Work
39Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
40Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
41Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
42Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
43Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
44Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
45Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
46Epidemiological models
- We consider modeling cascade generation as an
epidemic, with ideas as viruses. - We use the SIS model
- At any time, an entity is in one of two states
susceptible or infected. - One parameter ? determines how easily spreading
conversations are. - Hethcote2000
47Cascade Generation Model
0. Begin with Blog Net.
1
B1
B2
1
2
1
1
3
B3
B4
48Cascade Generation Model
0. Begin with Blog Net, but ignore edge weights.
Example B1 links to B2, B2 links to B1, B4 links
to B2 and B1, as well as itself B3 is isolated,
linking to itself.
B1
B2
B3
B4
49Cascade Generation Model
1. Randomly pick a blog to infect, add node to
cascade
B1
B1
B2
B3
B4
50Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
B1
B1
B2
B3
B4
51Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
DO NOT INFECT
B1
B1
B2
INFECT
B3
B4
52Cascade Generation Model
3. Add infected neighbors to cascade.
B1
B1
B2
B4
B3
B4
53Cascade Generation Model
4. Set old infected nodes to uninfected.
B1
B1
B2
B4
B3
B4
54Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
B3
B4
55Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
DO NOT INFECT
B3
B4
56Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
Completed cascade!
B1
B1
B2
B4
B3
B4
57CGM matches observations
- After trying several values, we decide on ?.025.
- 10 simulations, 2 million cascades each
- Most frequent cascades 7 of 10 matched exactly.
model
data
58CGM matches observations
Cascade size in this model also follows a power
law-- the model distribution is shown with the
real data points.
Count
Cascade size (number of nodes)
59CGM matches observations
- Stars and chains both follow power laws, close to
those observed in real data.
Count
Count
Star size
Chain size
60Results in brief
- Analyzed one of largest available collections of
blog information. - Two networks Post network and blog network.
- Discovered several properties of the networks.
- Also analyzed properties of cascades.
- Presented generative model for cascades.
61Immediate questions answered
- Temporal questions Does popularity have
half-life? Is there periodicity? - Popularity dropoff follows a power-law
distribution exactly as found in response times
in other work. We do find that posts follow
weekly periodicity.
Number of in-links
Days after post
62Immediate questions answered
- Topology What topological patterns do posts and
blogs follow? What shapes to cascades take on?
Stars? Chains? Something else? - We find power law distributions in almost every
topological property. In cascade shapes, stars
are more common than chains, and size of cascades
follow a power law. Cascades are tree-like.
Count
Count
Size of chain ( nodes)
Size of star ( nodes)
63Immediate questions answered
- Can a simple model replicate this behavior?
- Yes. We developed a model based on the SIS model
in epidemiology. It is a simple model with only
one parameter, and it produces behavior
remarkably similar to that found in the dataset.
Count
Count
Star size
Chain size
64Future work and applications
- This work suggested that ideas may behave like
viruses under an SIS model. - This may be useful for mapping social/political
trends. - Further investigation into these properties may
also allow us early detection of changes in
social or economic structure.
65Related work
- For explanation of SIS model
- Hethcote2000 H.W. Hethcote. The mathematics
of infectious diseases. SIAM Rev., 42(4)599653,
2000. - For algorithms for extracting cascade shapes
- Leskovec2006 J. Leskovec, A. Singh, and J.
Kleinberg. Patterns of influence in a
recommendation network. PAKDD 2006. - For some modeling of power laws
- Vazquez2006 A. Vazquez, J. G. Oliveira, Z.
Dezso, K. I. Goh, I. Kondor, and A. L. Barabasi.
Modeling bursts and heavy tails in human
dynamics. Physical Review E, 73036127, 2006.
66Additional Info
- Mary McGlohon
- www.cs.cmu.edu/mmcgloho
- mcglohon_at_cmu.edu
67Acknowledgments
- Mary McGlohon was partially supported by an NSF
Graduate Fellowship. - Jure Leskovec was partially supported by a
Microsoft Fellowship.
67
68Questions?
69 70Preliminaries- PCA
- We will work with very high-dimensional data
(9,000 dimensions). - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
70
71Preliminaries- PCA
- We will work with very high-dimensional data
(9,000 dimensions). - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
71
72Preliminaries- PCA
- We will work with very high-dimensional data
(9,000 dimensions). - Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
72
73Preliminaries- PCA
We can represent any real N x M matrix X as X U
x ? x Vt
where U is size N x r, r is the rank of matrix
X, ? is diagonal r x r matrix and V is M x r.
X
U
?
Vt
74Preliminaries- PCA
- Reduce dimensionality by setting all other
components of ? to zero.
x
x
75Preliminaries- PCA
x
x
- Reference Fukunaga, K. (1990). Introduction to
Statistical Pattern Recognition, Academic Press.
76Preliminaries- Regularizing data
- Not everything in life is normally distributed. ?
Blog properties, linear-linear scale
Total In-links
Total Conversation Mass Downwards
77Preliminaries- Regularizing data
- Not everything in life is normally distributed. ?
Blog properties, linear-linear scale
Total In-links
99.4 of points!
Total Conversation Mass Downwards
78Preliminaries Regularizing data
- Not everything in life is normally distributed. ?
Blog properties, linear-linear scale
Try to fit a line...
Total In-links
Total Conversation Mass Downwards
79Preliminaries Regularizing data
- Not everything in life is normally distributed. ?
Blog properties, linear-linear scale
Try to fit a line... Outliers dramatically
affect fit.
Total In-links
Total Conversation Mass Downwards
80Preliminaries Regularizing data
- Not everything in life is normally distributed. ?
- Therefore, we propose to take log(count1).
Blog properties, log-log scale
Total In-links
Total Conversation Mass Downwards
81Preliminaries Regularizing data
- Not everything in life is normally distributed. ?
- Therefore, we propose to take log(count1).
Blog properties, log-log scale
Outliers effects are minimized.
Total In-links
Total Conversation Mass Downwards
82- Suppose we want to cluster blogs based on
content. What features do we use per blog?
83CascadeType
- Perform PCA on sparse matrix.
- Use log(count1)
- Project onto 2 PC
9,000 cascade types
44,000 blogs
84CascadeType Results
- Observation Content of blogs and cascade
behavior are often related.
- Distinct clusters for conservative and
humorous blogs (hand-labeling).
84
85CascadeType Results
- Observation Content of blogs and cascade
behavior are often related.
- Distinct clusters for conservative and
humorous blogs (hand-labeling).
85
86- Suppose we want to cluster blog posts. What
features do we use?
87Preliminaries- Blogs
- There are several terms we use to describe
cascades - In-link, out-link
- Green node has one out-link
- Yellow node has one in-link.
- Depth downwards/upwards
- Pink node has an upward depth of 1,
- downward depth of 2.
- Conversation mass upwards/downwards
- Pink node has upward CM 1,
- downward CM 3
87
88PostFeatures
in-links out-links CM up CM down
depth up depth down
Run PCA
2,400,000 posts
88
89PostFeatures Results
- Observation Posts within a blog tend to retain
similar network characteristics.
90PostFeatures Results
- Observation Posts within a blog tend to retain
similar network characteristics.
- PC1 CM upward
- PC2 CM downward
- We show this scatter plot instead.
MichelleMalkin
Dlisted
91Ranking blogs by PostFeatures
- Conversation mass up/down gives a better
understanding of the blog posts than in-links and
out-links. - Therefore, we may choose to rank blogs based on
these attributes.
91
92Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links
- michellemalkin.com
- boingboing.net
- imao.us (75)
- captainsquartersblog.com/mt
- instapundit.com
- radioequalizer.blogspot.com (53)
- powerlineblog.com
- waxy.org/links
- washingtonmonthly.com
- kottke.org/reminder
- boingboing.net
- michellemalkin.com
- instapundit.com
- waxy.org/links
- kottke.com/reminder
- patriotdaily.com (11)
- captainsquartersblog.com/mt
- powerlineblog.com
- washingtonmonthly.com
- petashon.com (30)
92
93Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links
- michellemalkin.com
- boingboing.net
- imao.us (75)
- captainsquartersblog.com/mt
- boingboing.net
- michellemalkin.com
- instapundit.com
- waxy.org/links
..... 10 petashon.com (30)
in-links 2 CM 6
in-links 5 CM 5
- Perhaps IMAO has longer cascades, just fewer
inlinks. - While petashun has stars.
93
94BlogTimeFractal some time series
- Problem time series data is nonuniform and
difficult to analyze. - Any patterns?
- Any measures?
in-links over time
95BlogTimeFractal Definitions
- Any patterns?
- Self similarity!
- The 80-20 law describes self-similarity.
- For any sequence, we divide it into two
equal-length subsequences. 80 of traffic is in
one, 20 in the other. - Repeat recursively.
95
96Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
97Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
Q How do we estimate b?
98Self-similarity
- The bias factor for the 80-20 law is b0.8.
20
80
Q How do we estimate b?
A Entropy plots!
99BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution R
T/2. - Record entropy HR
99
100BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution R
T/2. - Record entropy HR
- Recursively take finer resolutions.
100
101BlogTimeFractal
- An entropy plot plots entropy vs. resolution.
- From time series data, begin with resolution r
T/2. - Record entropy Hr
- Recursively take finer resolutions.
101
102BlogTimeFractal Definitions
- Entropy measures the non-uniformity of histogram
at a given resolution. - We define entropy of our sequence at given R
- where p(t) is percentage of posts from a blog on
interval t, R is resolution and 2R is number of
intervals.
103BlogTimeFractal
- For a b-model (and self similar cases), entropy
plot is linear. The slope s will tell us the
bias factor. - Lemma For traffic generated by a b-model, the
bias factor b obeys the equation - s - b log2 b (1-b) log2 (1-b)
103
104Entropy Plots
- Linear plot ? Self-similarity
Entropy
Resolution
105Entropy Plots
- Linear plot ? Self-similarity
- Uniform slope s1. bias.5
- Point mass s0. bias1
Entropy
Resolution
106Entropy Plots
- Linear plot ? Self-similarity
- Uniform slope s1. bias.5
- Point mass s0. bias1
Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
107BlogTimeFractal Results
- Observation Most time series of interest are
self-similar. - Observation Bias factor is approximately 0.7--
that is, more bursty than uniform (70/30 law).
Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
107
108 109Ali-Hasen, Adamic 2007
Expressing Social Relationships on the Blog
through Links and Comments Analyzed three blog
communities
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
110Duarte et. al. 2007
- Classified blogs into parlor, register, and
broadcast.
register
Fractions of sessions with comments
parlor
broadcast
Total sessions
111Adar et. al. 2004
- Implicit Structure and the Dynamics of Blogspace
- Suggested that ideas behaved like epidemics.
- Presented iRank based on how infectious a blog
was.
(giant microbes, a site infectious in more ways
than one)