Title: STATISTICAL PROPERTIES OF THE WIKIGRAPH
1STATISTICAL PROPERTIES OF THE WIKIGRAPH
arXivphysics/0602026
- Capocci, V. Servedio, F. Colaiori,
- D. Donato, L.S. Buriol, S. Leonardi , GC
Centro E. Fermi
2STATISTICAL PROPERTIES OF THE WIKIGRAPH
3STATISTICAL PROPERTIES OF THE WIKIGRAPH
Wikipedia in other languages You may read and
edit articles in many different
languages Wikipedia encyclopedia languages with
over 100,000 articles Deutsch (German)
Français (French) Italiano (Italian)
(Japanese) Nederlands (Dutch) Polski (Polish)
Português (Portuguese) Svenska (Swedish)
Wikipedia encyclopedia languages with over
10,000 articles ??????? (Arabic) ?????????
(Bulgarian) Català (Catalan) Cesky (Czech)
Dansk (Danish) Eesti (Estonian) Español
(Spanish) Esperanto Galego (Galician) ?????
(Hebrew) Hrvatski (Croatian) Ido Bahasa
Indonesia (Indonesian) ??? (Korean) Lietuviu
(Lithuanian) Magyar (Hungarian) Bahasa Melayu
(Malay) Norsk bokmål (Norwegian) Norsk
nynorsk (Norwegian) Româna (Romanian) ???????
(Russian) Slovencina (Slovak) Slovencina
(Slovenian) ?????? (Serbian) Suomi (Finnish)
Türkçe (Turkish) ?????????? (Ukrainian) ??
(Chinese) Wikipedia encyclopedia languages with
over 1,000 articles Alemannisch (Alemannic)
Afrikaans Aragonés (Aragonese) Asturianu
(Asturian) Az?rbaycan (Azerbaijani)
Bân-lâm-gú (Min Nan) ?????????? (Belarusian)
Bosanski (Bosnian) Brezhoneg (Breton) ?a???
?e??? (Chuvash) Corsu (Corsican) Cymraeg
(Welsh) ???????? (Greek) Euskara (Basque)
????? (Persian) Føroyskt (Faroese) Frysk
(Western Frisian) Gaeilge (Irish) Gàidhlig
(Scots Gaelic) ?????? (Hindi) Interlingua
Íslenska (Icelandic) Basa Jawa (Javanese)
??????? (Georgian) ????? (Kannada) Kurdî /
????? (Kurdish) Latina (Latin) Latvieu
(Latvian) Lëtzebuergesch (Luxembourgish)
Limburgs (Limburgish) ?????????? (Macedonian)
????? (Marathi) Napulitana (Neapolitan)
Occitan ???? (Ossetic) Plattdüütsch (Low
Saxon) Scots Sicilianu (Sicilian) Simple
English Shqip (Albanian) Sinugboanon
(Cebuano) Srpskohrvatski/??????????????
(SerboCroatian) ????? (Tamil) Tagalog
??????? (Thai) Tatarça (Tatar) ??????
(Telugu) Ti?ng Vi?t (Vietnamese) Walon
(Walloon) Complete list Multilingual
coordination Start a Wikipedia in another
language
4STATISTICAL PROPERTIES OF THE WIKIGRAPH
A Nature investigation aimed to find if Wikipedia
is an authoritative source of information with
respect to established sources as Encyclopedia
Britannica.
- Among 42 entries tested, the difference in
accuracy was not particularly great - the average science entry in Wikipedia contained
around four inaccuracies - the one in Britannica, about three.
- On the other hand the articles on Wikipedia are
longer on average than those of Britannica. This
accounts for a lower rate of errors in Wikipedia.
- In a survey of more than 1,000 Nature authors
- 70 had heard of Wikipedia
- of those
- 17 of those consulted it on a weekly basis.
- less than 10 help to update it
(Nature 438, 900-901 2005)
5STATISTICAL PROPERTIES OF THE WIKIGRAPH
6STATISTICAL PROPERTIES OF THE WIKIGRAPH
Actually, things are a little bit more complicated
7STATISTICAL PROPERTIES OF THE WIKIGRAPH
There is not only control by users, but also
conflict of interests. Thereby sometimes is not
possible to modify 100 of the structure since
some sites are locked. One of the biggest scandal
was the biography of Journalist John
Seigenthaler who was accused to be involved in
the murder of President J.F. Kennedy
Some issues and languages have more controls than
others. An experiment made by Italian newspaper
Lespresso introduced Deliberately some errors
in two voices
- One in the career of Football player Rui Costa
(to be part of an Italian team in the early 90s) - To introduce a non-existing philosopher
- Obviously
- The error for the football player was corrected
after 30 - The philosopher remained in place until the
experiment was published ( at least two weeks)
8STATISTICAL PROPERTIES OF THE WIKIGRAPH
WHY STUDYING WIKIPEDIA?
- sociological reasons the encyclopedia collects
pages written by a number of indipendent and
eterogeneous individuals. Each of them
autonomously decides about the content of the
articles with the only constraint of a prefixed
layout. The autonomy is a common feature of the
content creation in the Web. The wikipedia
authors community is formed by members whose
only wish is to make available to the world
concepts and topics that they consider
meaningful. In some sense, tracing the evolution
of the wikipedia subsets should mirror the
develop of significant trends within each
linguistic community. - generation on time wikipedia provides time
information associated with nodes. Moreover, it
provides old information time information for
the creation and the modifications for each page
on the dataset. - independency of external links wikipedia
articles link mainly to articles on the same
dataset. - variety of graph sizes it can be collected one
graph by language, and the graph dimensions vary
from a few hundred pages up to half million pages.
9STATISTICAL PROPERTIES OF THE WIKIGRAPH
- Summarizing
- We have available all the history of growth, so
that we can study the evolution - We have an example of a social network of huge
size - We can compare the system produced by users of
different language, thereby - measuring the effect of different cultures.
- We can study Wikipedia as a case study for the
World Wide Web
WE RECOVER A PREFERENTIAL ATTACHMENT MECHANISM
FROM THE DATA. DIFFERENT LANGUAGES PRODUCE
SIMILAR STRUCTURES WE FIND A SYSTEM SIMILAR TO
THE WWW EVEN IF THE MICROSCOPIC RULE OF GROWTH IS
VERY DIFFERENT.
10STATISTICAL PROPERTIES OF THE WIKIGRAPH
The datasets of each language are available in
two selfextracting files for mysql database. The
table cur contains the current on-line articles,
whereas the table old contains all previous
versions of each current article. Old versions of
an article are identified for using the same
title, and not the same id. The dataset dumps are
updated almost weekly, so the current graph is
usually not more than a week old. For
generating a graph from the link structure of a
dataset, each article is considered a node and
each hyperlink between articles is a link in this
graph. In the wikipedia datasets, each webpage is
a single article. An article also might contain
some external links that point pages outside the
dataset. Usually wikipedia articles has no
external links, or just a few of them. These kind
of links are not considered for generating the
wikigraphs, since we want to restrict the graph
to pages into the set being analyzed.
11STATISTICAL PROPERTIES OF THE WIKIGRAPH
We generated six wikigraphs, wikiEN, wikiDE,
wikiFR, wikiES, wikiIT and wikiPT, generated from
the English, German, French, Spanish, Italian and
Portuguese datasets, respectively. The graphs
were obtained from an old dump of June 13, 2004.
We are not using the current data due to disk
space restrictions. The English dataset of June
2005 has more than 36 GB compacted, that is about
200 GB expanded.
The page that was mostly visited was the main
pages for wikiEN, wikiDE, wikiFR and wikiES,
while that for the datasets wikiIT and wikiPT
there were no visits associated with the pages.
12STATISTICAL PROPERTIES OF THE WIKIGRAPH
- SCC (Strongly Connected Component) includes
pages that are mutually reachable by traveling on
the graph - IN component is the region from which one can
reach SCC - OUT component encompasses the pages reached from
SCC. - TENDRILS are pages reacheable from the IN
component,and not pointing to SCC or OUT region
TENDRILS also includes those pages that point to
the OUT region not belonging to any of the other
de?ned regions. - TUBES connect directly IN and OUT regions,
- DISCONNECTED regions are those isolated from the
rest.
The Bow-tie structure, found in the WWW (Broder
et al. Comp. Net. 33, 309, 2000)
13STATISTICAL PROPERTIES OF THE WIKIGRAPH
The measure/size of the Wikigraph for the various
languages.
The percentage of the various components of the
Wikigraph for the various languages.
14STATISTICAL PROPERTIES OF THE WIKIGRAPH
The Degree shows fat tails that can be
approximated by a power-law function of the kind
P(k) k-g Where the exponent is the same both
for in-degree and out-degree.
In the case of WWW 2 gin 2.1
indegree(empty) and outdegree(filled)
Occurrency distributions for the Wikgraph in
English (?) and Portuguese (?).
15STATISTICAL PROPERTIES OF THE WIKIGRAPH
As regards the assortativity (as measured by the
average degree of the neighbours of a vertex with
degree k) there is no evidence of any assortative
behaviour.
The average neighbors indegree, computed along
incoming edges, as a function of the indegree
for the English (?) and Portuguese (?)
16STATISTICAL PROPERTIES OF THE WIKIGRAPH
The pagerank distribution for wikiEN is a power
law function with ? 2.1. Previous measures in
webgraphs also exhibit the same behaviour for
the pagerank distribution. We list the number
of visits of the top ranked pages just to show
that this value is not related with the pagerank
values. We confirm that very little correlation
was found between the link analysis
characteristics and the actual number of visits.
17STATISTICAL PROPERTIES OF THE WIKIGRAPH
Given the history of growth one can verify the
hypothesis of preferential attachment. This is
done by means of the histogram P(k) who gives the
number of vertices (whose degree is k) acquiring
new connections at time t. This is quantity is
weighted by the factor N(t)/n(k,t)
We find preferential attachment for in and out
degree.
English (?) and Portuguese (?). White
in-degree Filled out-degree
18STATISTICAL PROPERTIES OF THE WIKIGRAPH
In our opinion the nature of this preferential
attachment is effective ratther than the real
driving force in the phenomenon.
In other words the linear preferential attachment
can be originated by a copying procedure (new
vertices are introduced by copying old ones and
keeping most of the edges). Also we could have a
sort of fitness for the various entries (but in
this case one has a multidimensional series of
quantities describing the importance of one
page).
Apart the interpretation the data show a rather
clear LINEAR PREFERENTIAL ATTACHMENT
19STATISTICAL PROPERTIES OF THE WIKIGRAPH
Other power-laws related to dyamics need to be
explained For example the number of updates also
follows a power law.
Each point presents the number of nodes (y axis)
that were updated exactly x times.
20STATISTICAL PROPERTIES OF THE WIKIGRAPH
This feature is time invariant
21STATISTICAL PROPERTIES OF THE WIKIGRAPH
From these data it seems that a model in the
spirit of BA could reproduce most of the features
of the system.
- Actually
- This network is oriented.
- The preferential attachment in Wikipedia has a
somewhat different nature. Here, most of the
times, the edges are added between existing
vertices differently from the BA model. For
instance, in the English version of Wikipedia a
largely dominant fraction 0.883 of new edges is
created between two existing pages, while a
smaller fraction of edges points or leaves a
newly added vertex (0.026 and 0.091 respectively).
22STATISTICAL PROPERTIES OF THE WIKIGRAPH
- We introduced an evolution rule, similar to other
models of - rewiring already considered,
- At each time step, a vertex is added to the
network. It is connected to the existing
vertices by M oriented edges the direction of
each edge is drawn at random - with probability R1 the edge leaves the new
vertex pointing to an existing one chosen with
probability proportional to its indegree - with probability R2, the edge points to the new
vertex, and the source vertex is chosen with
probability proportional to its outdegree. - Finally, with probability R3 1 - R1 - R2 the
edge is added between existing vertices the
source vertex is chosen with probability
proportional to the outdegree, while the
destination vertex is chosen with probability
proportional to the indegree.
See for example Krapivsky Rodgers and Redner
PRL 86 5401 (2001)
23STATISTICAL PROPERTIES OF THE WIKIGRAPH
The model can be solved analytically
We can use for the model the empirical values of
R10.026 R20.091 R30.083 Already measured for
the English version of Wikigraph
P(kin) kin- gin gin -(11/(1-R2))
P(kout) kout- gout gout -(11/(1-R1))
gin ? 2.100 gout ? 2.027
24STATISTICAL PROPERTIES OF THE WIKIGRAPH
The model can be solved analytically
Knnin (kin) M N1-R1 R1R2/R3 (R3?0)
Both cases is constant
Knnin (kin) M R1R2 ln (N) (R30)
The value of the constant depends also upon the
initial conditions. The two lines refer to two
realizations of the model where in one case the
0.5 of the first vertices has been removed.
25STATISTICAL PROPERTIES OF THE WIKIGRAPH
- We have a structure that resembles the bow-tie
of the WWW - We have a power-law decay for the degree
distributions and also - a power-law decay for the number of one page
updates - Preferential Attachment in the Rewiring seems to
be the driving force - in the evolution of the system
- The microscopic structure of rewiring is very
different from that of WWW - In principle a user can change any series of
edges and add as many - pages as wanted. Still most of the quantities
are similar
26STATISTICAL PROPERTIES OF THE WIKIGRAPH
It turns out that the pagerank of the pages is
not related with the number of visit opens a very
interesting scenario for further research work.
Since, by definition, pagerank should give us the
visit time of the page and since actually it is
complety indipendent by the number of visits, we
wonder if pagerank is a good measure of the
authoritativeness of the pages in wikigraphs and
which modifications should be introduced in order
to tune its performances.