STATISTICAL PROPERTIES OF THE WIKIGRAPH - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

STATISTICAL PROPERTIES OF THE WIKIGRAPH

Description:

You may read and edit articles in many different languages: ... Also we could have a sort of fitness for the various entries (but in this case ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 27
Provided by: guid87
Category:

less

Transcript and Presenter's Notes

Title: STATISTICAL PROPERTIES OF THE WIKIGRAPH


1
STATISTICAL PROPERTIES OF THE WIKIGRAPH
arXivphysics/0602026
  • Capocci, V. Servedio, F. Colaiori,
  • D. Donato, L.S. Buriol, S. Leonardi , GC

Centro E. Fermi
2
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

3
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

Wikipedia in other languages You may read and
edit articles in many different
languages Wikipedia encyclopedia languages with
over 100,000 articles Deutsch (German)
Français (French) Italiano (Italian)
(Japanese) Nederlands (Dutch) Polski (Polish)
Português (Portuguese) Svenska (Swedish)
Wikipedia encyclopedia languages with over
10,000 articles ??????? (Arabic) ?????????
(Bulgarian) Català (Catalan) Cesky (Czech)
Dansk (Danish) Eesti (Estonian) Español
(Spanish) Esperanto Galego (Galician) ?????
(Hebrew) Hrvatski (Croatian) Ido Bahasa
Indonesia (Indonesian) ??? (Korean) Lietuviu
(Lithuanian) Magyar (Hungarian) Bahasa Melayu
(Malay) Norsk bokmål (Norwegian) Norsk
nynorsk (Norwegian) Româna (Romanian) ???????
(Russian) Slovencina (Slovak) Slovencina
(Slovenian) ?????? (Serbian) Suomi (Finnish)
Türkçe (Turkish) ?????????? (Ukrainian) ??
(Chinese) Wikipedia encyclopedia languages with
over 1,000 articles Alemannisch (Alemannic)
Afrikaans Aragonés (Aragonese) Asturianu
(Asturian) Az?rbaycan (Azerbaijani)
Bân-lâm-gú (Min Nan) ?????????? (Belarusian)
Bosanski (Bosnian) Brezhoneg (Breton) ?a???
?e??? (Chuvash) Corsu (Corsican) Cymraeg
(Welsh) ???????? (Greek) Euskara (Basque)
????? (Persian) Føroyskt (Faroese) Frysk
(Western Frisian) Gaeilge (Irish) Gàidhlig
(Scots Gaelic) ?????? (Hindi) Interlingua
Íslenska (Icelandic) Basa Jawa (Javanese)
??????? (Georgian) ????? (Kannada) Kurdî /
????? (Kurdish) Latina (Latin) Latvieu
(Latvian) Lëtzebuergesch (Luxembourgish)
Limburgs (Limburgish) ?????????? (Macedonian)
????? (Marathi) Napulitana (Neapolitan)
Occitan ???? (Ossetic) Plattdüütsch (Low
Saxon) Scots Sicilianu (Sicilian) Simple
English Shqip (Albanian) Sinugboanon
(Cebuano) Srpskohrvatski/??????????????
(SerboCroatian) ????? (Tamil) Tagalog
??????? (Thai) Tatarça (Tatar) ??????
(Telugu) Ti?ng Vi?t (Vietnamese) Walon
(Walloon) Complete list Multilingual
coordination Start a Wikipedia in another
language
4
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

A Nature investigation aimed to find if Wikipedia
is an authoritative source of information with
respect to established sources as Encyclopedia
Britannica.
  • Among 42 entries tested, the difference in
    accuracy was not particularly great
  • the average science entry in Wikipedia contained
    around four inaccuracies
  • the one in Britannica, about three.
  • On the other hand the articles on Wikipedia are
    longer on average than those of Britannica. This
    accounts for a lower rate of errors in Wikipedia.
  • In a survey of more than 1,000 Nature authors
  • 70 had heard of Wikipedia
  • of those
  • 17 of those consulted it on a weekly basis.
  • less than 10 help to update it

(Nature 438, 900-901 2005)
5
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

6
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

Actually, things are a little bit more complicated
7
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

There is not only control by users, but also
conflict of interests. Thereby sometimes is not
possible to modify 100 of the structure since
some sites are locked. One of the biggest scandal
was the biography of Journalist John
Seigenthaler who was accused to be involved in
the murder of President J.F. Kennedy
Some issues and languages have more controls than
others. An experiment made by Italian newspaper
Lespresso introduced Deliberately some errors
in two voices
  • One in the career of Football player Rui Costa
    (to be part of an Italian team in the early 90s)
  • To introduce a non-existing philosopher
  • Obviously
  • The error for the football player was corrected
    after 30
  • The philosopher remained in place until the
    experiment was published ( at least two weeks)

8
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction

WHY STUDYING WIKIPEDIA?
  • sociological reasons the encyclopedia collects
    pages written by a number of indipendent and
    eterogeneous individuals. Each of them
    autonomously decides about the content of the
    articles with the only constraint of a prefixed
    layout. The autonomy is a common feature of the
    content creation in the Web. The wikipedia
    authors community is formed by members whose
    only wish is to make available to the world
    concepts and topics that they consider
    meaningful. In some sense, tracing the evolution
    of the wikipedia subsets should mirror the
    develop of significant trends within each
    linguistic community.
  • generation on time wikipedia provides time
    information associated with nodes. Moreover, it
    provides old information time information for
    the creation and the modifications for each page
    on the dataset.
  • independency of external links wikipedia
    articles link mainly to articles on the same
    dataset.
  • variety of graph sizes it can be collected one
    graph by language, and the graph dimensions vary
    from a few hundred pages up to half million pages.

9
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Introduction
  • Summarizing
  • We have available all the history of growth, so
    that we can study the evolution
  • We have an example of a social network of huge
    size
  • We can compare the system produced by users of
    different language, thereby
  • measuring the effect of different cultures.
  • We can study Wikipedia as a case study for the
    World Wide Web

WE RECOVER A PREFERENTIAL ATTACHMENT MECHANISM
FROM THE DATA. DIFFERENT LANGUAGES PRODUCE
SIMILAR STRUCTURES WE FIND A SYSTEM SIMILAR TO
THE WWW EVEN IF THE MICROSCOPIC RULE OF GROWTH IS
VERY DIFFERENT.
10
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Data

The datasets of each language are available in
two selfextracting files for mysql database. The
table cur contains the current on-line articles,
whereas the table old contains all previous
versions of each current article. Old versions of
an article are identified for using the same
title, and not the same id. The dataset dumps are
updated almost weekly, so the current graph is
usually not more than a week old. For
generating a graph from the link structure of a
dataset, each article is considered a node and
each hyperlink between articles is a link in this
graph. In the wikipedia datasets, each webpage is
a single article. An article also might contain
some external links that point pages outside the
dataset. Usually wikipedia articles has no
external links, or just a few of them. These kind
of links are not considered for generating the
wikigraphs, since we want to restrict the graph
to pages into the set being analyzed.
11
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Data

We generated six wikigraphs, wikiEN, wikiDE,
wikiFR, wikiES, wikiIT and wikiPT, generated from
the English, German, French, Spanish, Italian and
Portuguese datasets, respectively. The graphs
were obtained from an old dump of June 13, 2004.
We are not using the current data due to disk
space restrictions. The English dataset of June
2005 has more than 36 GB compacted, that is about
200 GB expanded.
The page that was mostly visited was the main
pages for wikiEN, wikiDE, wikiFR and wikiES,
while that for the datasets wikiIT and wikiPT
there were no visits associated with the pages.
12
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Topology
  • SCC (Strongly Connected Component) includes
    pages that are mutually reachable by traveling on
    the graph
  • IN component is the region from which one can
    reach SCC
  • OUT component encompasses the pages reached from
    SCC.
  • TENDRILS are pages reacheable from the IN
    component,and not pointing to SCC or OUT region
    TENDRILS also includes those pages that point to
    the OUT region not belonging to any of the other
    de?ned regions.
  • TUBES connect directly IN and OUT regions,
  • DISCONNECTED regions are those isolated from the
    rest.

The Bow-tie structure, found in the WWW (Broder
et al. Comp. Net. 33, 309, 2000)
13
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Topology

The measure/size of the Wikigraph for the various
languages.
The percentage of the various components of the
Wikigraph for the various languages.
14
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Topology

The Degree shows fat tails that can be
approximated by a power-law function of the kind
P(k) k-g Where the exponent is the same both
for in-degree and out-degree.
In the case of WWW 2 gin 2.1
indegree(empty) and outdegree(filled)
Occurrency distributions for the Wikgraph in
English (?) and Portuguese (?).
15
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Topology

As regards the assortativity (as measured by the
average degree of the neighbours of a vertex with
degree k) there is no evidence of any assortative
behaviour.
The average neighbors indegree, computed along
incoming edges, as a function of the indegree
for the English (?) and Portuguese (?)
16
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Topology

The pagerank distribution for wikiEN is a power
law function with ? 2.1. Previous measures in
webgraphs also exhibit the same behaviour for
the pagerank distribution. We list the number
of visits of the top ranked pages just to show
that this value is not related with the pagerank
values. We confirm that very little correlation
was found between the link analysis
characteristics and the actual number of visits.
17
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Dynamics

Given the history of growth one can verify the
hypothesis of preferential attachment. This is
done by means of the histogram P(k) who gives the
number of vertices (whose degree is k) acquiring
new connections at time t. This is quantity is
weighted by the factor N(t)/n(k,t)
We find preferential attachment for in and out
degree.
English (?) and Portuguese (?). White
in-degree Filled out-degree
18
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Dynamics

In our opinion the nature of this preferential
attachment is effective ratther than the real
driving force in the phenomenon.
In other words the linear preferential attachment
can be originated by a copying procedure (new
vertices are introduced by copying old ones and
keeping most of the edges). Also we could have a
sort of fitness for the various entries (but in
this case one has a multidimensional series of
quantities describing the importance of one
page).
Apart the interpretation the data show a rather
clear LINEAR PREFERENTIAL ATTACHMENT
19
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Dynamics

Other power-laws related to dyamics need to be
explained For example the number of updates also
follows a power law.
Each point presents the number of nodes (y axis)
that were updated exactly x times.
20
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Dynamics

This feature is time invariant
21
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Modelling

From these data it seems that a model in the
spirit of BA could reproduce most of the features
of the system.
  • Actually
  • This network is oriented.
  • The preferential attachment in Wikipedia has a
    somewhat different nature. Here, most of the
    times, the edges are added between existing
    vertices differently from the BA model. For
    instance, in the English version of Wikipedia a
    largely dominant fraction 0.883 of new edges is
    created between two existing pages, while a
    smaller fraction of edges points or leaves a
    newly added vertex (0.026 and 0.091 respectively).

22
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Modelling
  • We introduced an evolution rule, similar to other
    models of
  • rewiring already considered,
  • At each time step, a vertex is added to the
    network. It is connected to the existing
    vertices by M oriented edges the direction of
    each edge is drawn at random
  • with probability R1 the edge leaves the new
    vertex pointing to an existing one chosen with
    probability proportional to its indegree
  • with probability R2, the edge points to the new
    vertex, and the source vertex is chosen with
    probability proportional to its outdegree.
  • Finally, with probability R3 1 - R1 - R2 the
    edge is added between existing vertices the
    source vertex is chosen with probability
    proportional to the outdegree, while the
    destination vertex is chosen with probability
    proportional to the indegree.

See for example Krapivsky Rodgers and Redner
PRL 86 5401 (2001)
23
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Modelling

The model can be solved analytically
We can use for the model the empirical values of
R10.026 R20.091 R30.083 Already measured for
the English version of Wikigraph
P(kin) kin- gin gin -(11/(1-R2))
P(kout) kout- gout gout -(11/(1-R1))
gin ? 2.100 gout ? 2.027
24
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Modelling

The model can be solved analytically
Knnin (kin) M N1-R1 R1R2/R3 (R3?0)
Both cases is constant
Knnin (kin) M R1R2 ln (N) (R30)
The value of the constant depends also upon the
initial conditions. The two lines refer to two
realizations of the model where in one case the
0.5 of the first vertices has been removed.
25
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Conclusions
  • We have a structure that resembles the bow-tie
    of the WWW
  • We have a power-law decay for the degree
    distributions and also
  • a power-law decay for the number of one page
    updates
  • Preferential Attachment in the Rewiring seems to
    be the driving force
  • in the evolution of the system
  • The microscopic structure of rewiring is very
    different from that of WWW
  • In principle a user can change any series of
    edges and add as many
  • pages as wanted. Still most of the quantities
    are similar

26
STATISTICAL PROPERTIES OF THE WIKIGRAPH
  • Conclusions

It turns out that the pagerank of the pages is
not related with the number of visit opens a very
interesting scenario for further research work.
Since, by definition, pagerank should give us the
visit time of the page and since actually it is
complety indipendent by the number of visits, we
wonder if pagerank is a good measure of the
authoritativeness of the pages in wikigraphs and
which modifications should be introduced in order
to tune its performances.
Write a Comment
User Comments (0)
About PowerShow.com