Studying users behavior in chat rooms - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Studying users behavior in chat rooms

Description:

Chat rooms. The model is similar to multicast group. Users explicitly join ... IRC- Internet Relay Chat protocol. Run over TCP/IP. Text ... 'Chat room' ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 62
Provided by: mroc
Category:

less

Transcript and Presenter's Notes

Title: Studying users behavior in chat rooms


1
Studying users behavior in chat rooms
  • DANSS
  • January 25, 2004
  • Michael Rochkind

2
Agenda
  • Motivation
  • Project goals
  • What was done
  • Results
  • Conclusions

3
Motivation
  • Need for simulations of interactive end-users to
    evaluate algorithms and system designs (e.g
    algorithms for estimation of multicast group
    size)
  • Difficulty to get real data (both technical and
    administrative)
  • Most researchers use trace collected for audio
    multicast of IETF conference talks in 1996

4
Problems with the trace
  • Complete research field is based on a single
    trace
  • The trace is quite old (from 1996)
  • Collected from one specific type of service
    (audio conference). The exact nature of users is
    unknown. The behavior is not necessary the same
    as in other applications.
  • Impossible to validate the data or collect new
    one
  • Relatively little activity of members
  • Percentage of spurious joins/leaves is very high

5
Statistical analysis of the trace
  • Different researchers got different statistical
    models for various parameters.
  • Ammar and Almeroth (the original trace creators)
    obtained exponential model for most parameters
    and Zipf distribution for long session stay time.
  • Aluf, Altman, Nain recently obtained from the
    same long trace lognormal distribution for both
    inter-arrival times and stay times. For short
    multicast session they obtained Weibull
    distributions for both inter-arrival and stay
    times.
  • Assumed uniform distribution of users (spatial)

6
Project goals
  • To find a publicly available system which
    reasonably approximates multicast users behavior.
  • To develop tools for data retrieval so that it
    can be run by anyone, anytime.
  • To analyze the collected data

7
Parameters of interest
  • Inter-arrival time
  • Session duration (on-time)
  • Number of logged in users (group size)
  • Users activity (messages, bytes)
  • Geographical distribution of users
  • Lifespan of multicast event (for short events)
  • Comparison with the famous trace

8
First try - message boards (Yahoo)
  • Difficult to define term of user session. Many
    users send just one message.
  • Only active users can be seen (writers)
  • A lot of information is missing (about 50)
  • Activity peaks when outstanding events happen

9
Chat rooms
  • The model is similar to multicast group
  • Users explicitly join the room and leave it
  • Join/leave time and stay time are well-defined.
  • Every message sent to the room is received by all
    room members

10
IRC- Internet Relay Chat protocol
  • Run over TCP/IP
  • Text-based teleconferencing
  • Client-server model
  • Can run in distributed fashion
  • Five big networks with many tens of thousands
    users and thousands of channels (rooms)

11
IRC Servers
  • Form a backbone of IRC network
  • Connected together without circles (in the form
    of a spanning tree)
  • Handle clients connections
  • Each server knows about all other servers and all
    clients.

C1
C2
S5
S1
S2
S3
S4
C3
C4
S6
12
IRC clients
  • IRC client is anything connected to IRC server
    which is not another IRC server.
  • Any TCP enabled device can be IRC client
  • Distinguished by unique nickname
  • Each IRC server has the following info about each
    IRC client
  • Nickname
  • Real name of the host where the client is running
  • Username of the client on that host
  • IRC server to which the client is connected

13
IRC Channels
  • Parallel to the term Chat room
  • Named group of one or more users which will all
    receive messages addressed to that channel.
  • Created when first user joins the channel
  • Ceases to exits when last users leaves it
  • In case of network split the channel on each side
    has only those clients connected to the servers
    in the corresponding side. After network
    reconnection the channel is joined again.

14
IRC network example
C1
S5
S1
S2
C2
S4
S3
C3
C4
S6
15
IRC message sending
C1
S5
S1
S2
C2
S4
S3
C3
C4
S6
16
IRC new member joins to a channel
  • Channel X with members C1, C2, C3
  • Client C4 joins the channel X

join c4
C1
S5
join c4
S1
join c4
S2
join c4
C2
join c4
join c4
S4
join c4
S3
1. Join X
join c4
C3
2. names c1, c2, c3
S6
C4
17
IRC Channel Monitoring
  • Monitoring client written in Perl running under
    cron
  • We choose randomly 3 channels from the group of
    all channels with more than 100 users israel,
    canada, bosnia
  • Channel activity data was collected for a period
    of about 6 weeks.

18
Log file format
  • START
  • EXIT
  • JOIN
  • PARTQUITKICK
  • PUBLIC
  • NICK
  • NAMES

19
1053586971 START 1053587032 JOIN wponiw
IL 1053587032 NAMES wponiw Teo_ i-NA mr_shark
_kNibAL_ kaye_22 Old-Man CHA_555 klent Leila19f
Dan kalanko1 Manifa21f jennider1 eu_sunt
mangko18 hotguy holly20f sad_beaut swimgirl
ghazde swt_guy pseudonym bing_23 topgirl23
sexYica creatza sergio9 ZaRa glance cookie
aileen Ugly-GirL AFNAN EclipseM laurra-f garden
cai applej SHUNSY fatcock kikelph mhaelee16 aGaTa
Ercko lonebabe shellaine juulia priti2
HuntI2ess 1053587032 NAMES gienah Amanda
Jamali lishat18 cute_ashf jhen Horbit Sana18
AloneMan3 Errikka ext-ex Maysmile ynet02 poem_37M
ann3 jelle love_less dreeve18 indai adze LiWeiYi
TokyoBoy blossom dummee man__ marichu earp danone
jackdaw faraz ANGELA25 boby27 leah_ jossie
shyrgil jade-17 kian arnulpo ally16 FiNG
Carmina42 bangd sohail Janine33 anne--- joyce22
LUIE_M Travioli corn HOMBREJ2 sexybabes spyk2000
barbi3 1053587032 NAMES tumbleWED Gaby3
chynna babyTH lenjie jherome Certified
dj_france jane36 micay shah goerge24 bluediamo
master_po Jypsy bassma Bobson Fil24f dimple2
_THERE_ AloneGirL Naked_f shark_nyk morena23
Danniel_m Arwen_ ofw_park jimbern m40usa restie
_at_PacZzZzZz blackstud davis He11razor MultiMind
mater Fearless Adnan_pk Ermya Helena BrainDead
CStrixAW wooden birkof Cute_Girl Lisa_--
Megaframe barbara- 1053587032 NAMES Simple
Loren23 Diana27 Cozzo NateDogg legendh Angel19
Mariah19 fedfed SUNSEEKER PRONET7 bestofmi
D0gGi3 Don_Juan MrNylons teapot SkiPerZ Br0Th4
Linutech ShowerMia JenJen Mariahhh optimist
_at_X 1053587032 JOIN D-A-D-I IN 1053587045 JOIN
sydneyguy AU 1053587047 PUBLIC Certified 17
US 1053587053 PUBLIC Certified 13 US 1053587059
JOIN Mckay28 MT 1053587063 NICK CHA_555
zHTe 1053587068 PUBLIC Certified 31
US 1053587076 PART zHTe 1053587080 JOIN
villain PH 1053587082 JOIN cryn PH 1053587095
JOIN staticx US 1053587098 PUBLIC Certified 31
US
20
Inter-arrival time
21
Inter-Arrival distribution bosnia
occurrences
Time (in sec)
occurrences
Time (in sec)
22
Inter-Arrival distribution israel
occurrences
Time (in sec)
occurrences
Time (in sec)
23
Inter-Arrival distribution canada
occurrences
Time (in sec)
occurrences
Time (in sec)
24
Inter-Arrival distribution
  • Distrubution looks similar for all three channels
  • The distribution is heavy-tailed from two main
    reasons
  • Network splits - add zero values (during
    reconnection) and big values (during the split)
  • Periods of low activity add tail (more actual for
    channels with non-uniform geographical
    distribution like bosnia)

25
Inter-arrival time fits
israel
  • LogNormal distribution is the best in almost all
    cases
  • The only exception is InvGauss distribution using
    A-D and K-S for israel
  • Exponential distribution is very far from being
    optimal

canada
bosnia
26
The audio trace inter-arrival fits
  • Inter-arrival time distribution is similar to IRC
    Channels
  • LogNormal/ InvGauss

27
Session Duration
28
Session duration distribution- israel
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
29
Session duration distribution- canada
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
30
Session duration distribution- bosnia
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
31
Session duration distribution
  • Very heavy tail for two reasons
  • Many users spent a lot of time in the channel
  • Robots

32
Session duration fits
israel
  • BetaGeneral distribution gives best fit using
    Chi-Square and K-S tests any time that we limit
    the data samples
  • LogNormal is always on the second place (and best
    fit using A-D tests)
  • When we dont limit the data samples LogNormal is
    the best.
  • Exponential is very far from being optimal

canada
bosnia
33
The audio trace session duration fits
  • Session durations is not similar -extremely heavy
    tail.
  • 90th percentile similar to IRC channels

occurrences
Time (in sec)
34
The audio trace session durations
Long sessions (1 min)
  • Long sessions are similar to IRC channels
  • The phenomenon of short sessions is unique to the
    audio trace. No analog in the IRC Channels

Short sessions (
35
Main affecting factors
  • Network failures (splits)
  • Robots and long staying users
  • Geographical distribution of users

36
IRC network splits
  • Any IRC server failure or link failure causes
    split.
  • For channel member a split looks like massive
    leave of users and reconnection looks as massive
    join of users.
  • Contribute big number of zeros to inter-arrival
    time (about 2 percent of joins come in groups)
  • Decrease session durations
  • Most splits lasts for up to 20 minutes

37
Short (temporal) Splits
  • Heuristic Find group of quits followed by a
    group of joins with the same users.
  • Finds only part of failures

38
Split durations
occurrences
Duration (sec)
39
Robots
We define robot as any client who is logged in
more than 8 hours in day in average.
  • Add constant to number of logged users
  • Add heavy tail to session durations
  • Dont affect inter-arrival and join statistics

40
Distribution of logged robots number
occurrences
Number of bots
41
Robots session durations (channel canada)
42
Geographical distribution
43
Geographical distribution during day hours
44
Number of logged in users (channel size)
45
Number of user joins per hour
46
User traffic (Israel)
Joins per hour
Hour of day
Channel size
Hour of day
47
User traffic (bosnia)
Joins per hour
Hour of day
Channel size
Hour of day
48
User traffic (canada)
Joins per hour
Hour of day
Channel size
Hour of day
49
User traffic as function of time of day
observations
  • The function is very stable over different days
  • The graph shape is mainly defined by geographical
    distribution of users
  • Has grate influence on other parameters
    distribution like number of on-line users, number
    of joins per hour.

50
Joins per hour distribution - israel
Joins in hour
occurrences
Joins in hour
51
Joins per hour distribution - bosnia
Joins in hour
occurrences
Joins in hour
52
Joins per hour distribution - canada
Joins in hour
occurrences
Joins in hour
53
Data traffic (Israel)
Msg per hour
Hour of day
Bytes Per hour
Hour of day
54
Data traffic (bosnia)
Msg per hour
Hour of day
Bytes Per hour
Hour of day
55
Data traffic (canada)
Msg per hour
Bytes Per hour
56
Data traffic observations
  • Two graphs are highly correlated due to the
    nature of the messages.
  • Some exceptions coming from robots violating the
    game rules.
  • Some correlation with number of logged in users
    but much more flat.

57
Users activity (writers)
58
Users activity (part 2)
59
Short multicast event
  • 10 start joining
  • 40 most participants joined
  • 50 last particip. joins. Event starts.
  • 110 event ends
  • 120 participants leave
  • 190 users leave

Time (minutes)
60
Short multicast event (data traffic)
msgs
  • Time resolution 5 min.

bytes
Time (minutes)
61
Conclusions
  • Modeling of multicast groups behavior through IRC
    users is possible.
  • Its difficult to fit empirical data into pure
    analytical models due to the combination of
    different factors (user types, system failures
    etc). Simulation process must take into account
    all these factors
  • The famous audio log is inadequate with respect
    to some important parameters
  • Traditional assumption about uniformity of
    spatial distribution is not always correct
  • Data logs and scripts are available for use
Write a Comment
User Comments (0)
About PowerShow.com