Uncovering Functional Networks in Internet Traffic - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Uncovering Functional Networks in Internet Traffic

Description:

Committee: Filippo Menczer, Alessandro Vespignani, Katy B rner, Minaxi ... surfing. sending email. playing games. 6. What ... Buddy's Web surfing as two ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 57

Provided by: edward118

Learn more at: http://vw.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Uncovering Functional Networks in Internet Traffic

1
Uncovering Functional Networks in Internet Traffic

Mark Meiss
September 25, 2006

2
Who am I?

Mark Meiss
Ph.D. candidate in Computer Science
Committee Filippo Menczer, Alessandro
Vespignani, Katy Börner, Minaxi Gupta, Kay
Connelly
Researcher at the Advanced Network Management
Laboratory (ANML)
http//anml.iu.edu/

3
(No Transcript)
4
Whats the agenda?

The subject of todays story
Finding a way to improve security without
compromising user privacy
A case study in applied network science
This work is done with Filippo Menczer and
Alessandro Vespignani.

5
What do people do online?
Theres what we imagine
6
What do people do online?
And theres what is actually happening
7
Not just a value judgment

These applications all affect the health of a
data network.
There are legal problems, yes but also
Crowding out other applications.
(Napster was once over 70 of all IUB traffic)
Compromised computers are used to launch further
attacks.
Common nuisances are on the Net as well.

8
The bottom line

Network administrators
need to be able to identify
what applications
are being used on the network.

but this can be very difficult.
9
A crash coursein data networks

Well use a running example
Buddy Bradley wants to read a web page about his
favorite band at Vulgar Entertainment, Inc.

10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Quick summary

Each network conversation is identified by four
pieces of information
Client address and port number
Server address and port number
The server uses a well-known port number
The client uses an ephemeral port number

21
So why is it hard to identify applications?

Well-known ports are a convention, not a rule
Web, e-mail, etc. do have ports assigned by the
IANA
BitTorrent, Gnutella, Napster, etc. do not
Client and server ports share the same namespace
In practice
Any application can use any pair of port numbers
Our focus discovering what application is
running on a port with no assigned use.

22
The conventional solution

Lets look inside
all of those packets!

23
(No Transcript)
24
(No Transcript)
25
Another problem

Packet inspection doesnt scale
Modern high-speed networks run at 10 gigabits per
second or faster
(thats one full DVD every few seconds)
General-purpose computers cant even copy that
data in real time

26
(No Transcript)
27
(No Transcript)
28
Introducing the flow

We can summarize Buddys Web surfing as two
flows
192.168.65.3313029 to 10.99.205.12280 (456
bytes)
10.99.205.12280 to 192.168.65.3313029 (63,211
bytes)

29
Where do flows come from?

Architectural features of Internet routers allow
them to export flow data
Routers cant summarize all the data
Packets are sampled to construct the flows
Typical sampling rate is around 1100

30
What can you dowith a flow?

Usual answer
Treat a flow as a record in a relational database
Who talked to port 1337?
What proportion of our traffic is on port 80?
Who is scanning for vulnerable systems?
Which hosts are infected with this worm?
These are useful and valid questions.

31
What can you dowith a flow?

Our approach
Treat a flow as a directed, weighted edge
The resulting network describes user behavior
Hold that thought for now

32
The Internet2/Abilene network

TCP/IP network connecting research and
educational institutions in the U.S.
Over 200 universities and corporate research labs
Also provides transit service between Pacific Rim
and European networks

33
Why study Abilene?

Wide-area network that includes both domestic and
international traffic
Heterogeneous user base including hundreds of
thousands of undergraduates
High capacity network (10-Gbps fiber-optic links)
that has never been congested
Research partnership gives access to (anonymized)
traffic data unavailable from commercial networks

34
Flow collection
Flows are exported in Ciscos netflow-v5
format and anonymized before being written to
disk.
35
Data dimensions

Observed Abilene on April 14, 2005
About 200 terabytes of data exchanged
This is roughly 25,000 DVDs of information
600 million flow records
Almost 28 gigabytes on disk
15 million unique hosts involved

36
Forming a bipartite network

Motivation
Clients and servers perform different functions
A web browser is very different from a web server
Most hosts are one or the other
Identifying clients and servers
Recall that there is a single namespace for ports
Heuristic the more common port is the server

37
Weighted bipartite digraph
38
(No Transcript)
39
Multiple digraphs
Port 80 (Web)
Port 6346 (Gnutella)
Port 19101 (???)
Port 25 (Mail)
40
Application correlation

Consider the out-strength of a client in the
networks for ports p and q

41
Application correlation

Build a pair of vectors from the distribution of
strength values

42
Application correlation

Examine the cosine similarity of the vectors
When s 0, applications p and q are never used
together.
When s 1, applications p and q are always used
together, and to the same extent.

43
Clustering applications

We now have s(p, q) for every pair of ports
Convert these similarities into distances
If s 0, then d is large if s 1, then d 0
Now apply Wards hierarchical clustering algorithm

44
(No Transcript)
45
Natural clusters

The Web
Correlated with almost every application
Use is nearly universal
Traditional applications
Includes mail, FTP, news, remote access, etc.
Characterized by dedicated servers
Peer-to-peer applications
Includes file sharing Gnutella, BitTorrent, etc.
Users often use several of these

46
Classifying unknownapplications

To classify an unknown application, see what
known applications it clusters with
Our classification experiment
Take 16 unknown ports
Guess function based on similarity data
Validate or invalidate guesses based on external
evidence

47
Example 1

Port 388 is coupled with FTP and Hotline
FTP is a file transfer application
Hotline is an early file-sharing application
Our guess traditional file transfer application
Actual identity Unidata/LDM
Used for moving large meteorological data sets

48
Example 2

Port 19101 is coupled with instant messaging and
P2P applications
Our guess a P2P application that relies on
individual contact for file transfers
Actual identity Clubbox
Korean file-sharing program
Users trade large files on virtual hard drives

49
(No Transcript)
50
Overall results

For our 16 guesses
8 were unambiguously correct
6 were partially correct
These turned out to be trojans and malware
We learned that IRC P2P evil afoot
2 could not be confirmed or disproven
Ports were in transient use during data collection

51
Implications

We can identify the type of an application
without examining a single packet!
Scalable
Preserves user privacy
Difficult to do with relational view of flow data

52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Broader application

Generic view of the situation
Weighted network of entities derived from
activity with labeled classes of interaction
Find the sub-network for each labeled class
Use the network distributions to calculate
similarity scores for the classes
Use the similarity scores to cluster the classes
Classify unknown classes using these clusters

58
Thank you!