Title: Uncovering Functional Networks in Internet Traffic
1Uncovering Functional Networks in Internet Traffic
- Mark Meiss
- September 25, 2006
2Who am I?
- Mark Meiss
- Ph.D. candidate in Computer Science
- Committee Filippo Menczer, Alessandro
Vespignani, Katy Börner, Minaxi Gupta, Kay
Connelly - Researcher at the Advanced Network Management
Laboratory (ANML) - http//anml.iu.edu/
3(No Transcript)
4Whats the agenda?
- The subject of todays story
- Finding a way to improve security without
compromising user privacy - A case study in applied network science
- This work is done with Filippo Menczer and
Alessandro Vespignani.
5What do people do online?
Theres what we imagine
6What do people do online?
And theres what is actually happening
7Not just a value judgment
- These applications all affect the health of a
data network. - There are legal problems, yes but also
- Crowding out other applications.
- (Napster was once over 70 of all IUB traffic)
- Compromised computers are used to launch further
attacks. - Common nuisances are on the Net as well.
8The bottom line
- Network administrators
- need to be able to identify
- what applications
- are being used on the network.
but this can be very difficult.
9A crash coursein data networks
- Well use a running example
- Buddy Bradley wants to read a web page about his
favorite band at Vulgar Entertainment, Inc.
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Quick summary
- Each network conversation is identified by four
pieces of information - Client address and port number
- Server address and port number
- The server uses a well-known port number
- The client uses an ephemeral port number
21So why is it hard to identify applications?
- Well-known ports are a convention, not a rule
- Web, e-mail, etc. do have ports assigned by the
IANA - BitTorrent, Gnutella, Napster, etc. do not
- Client and server ports share the same namespace
- In practice
- Any application can use any pair of port numbers
- Our focus discovering what application is
running on a port with no assigned use.
22The conventional solution
- Lets look inside
- all of those packets!
23(No Transcript)
24(No Transcript)
25Another problem
- Packet inspection doesnt scale
- Modern high-speed networks run at 10 gigabits per
second or faster - (thats one full DVD every few seconds)
- General-purpose computers cant even copy that
data in real time
26(No Transcript)
27(No Transcript)
28Introducing the flow
- We can summarize Buddys Web surfing as two
flows - 192.168.65.3313029 to 10.99.205.12280 (456
bytes) - 10.99.205.12280 to 192.168.65.3313029 (63,211
bytes)
29Where do flows come from?
- Architectural features of Internet routers allow
them to export flow data - Routers cant summarize all the data
- Packets are sampled to construct the flows
- Typical sampling rate is around 1100
30What can you dowith a flow?
- Usual answer
- Treat a flow as a record in a relational database
- Who talked to port 1337?
- What proportion of our traffic is on port 80?
- Who is scanning for vulnerable systems?
- Which hosts are infected with this worm?
- These are useful and valid questions.
31What can you dowith a flow?
- Our approach
- Treat a flow as a directed, weighted edge
- The resulting network describes user behavior
- Hold that thought for now
32The Internet2/Abilene network
- TCP/IP network connecting research and
educational institutions in the U.S. - Over 200 universities and corporate research labs
- Also provides transit service between Pacific Rim
and European networks
33Why study Abilene?
- Wide-area network that includes both domestic and
international traffic - Heterogeneous user base including hundreds of
thousands of undergraduates - High capacity network (10-Gbps fiber-optic links)
that has never been congested - Research partnership gives access to (anonymized)
traffic data unavailable from commercial networks
34Flow collection
Flows are exported in Ciscos netflow-v5
format and anonymized before being written to
disk.
35Data dimensions
- Observed Abilene on April 14, 2005
- About 200 terabytes of data exchanged
- This is roughly 25,000 DVDs of information
- 600 million flow records
- Almost 28 gigabytes on disk
- 15 million unique hosts involved
36Forming a bipartite network
- Motivation
- Clients and servers perform different functions
- A web browser is very different from a web server
- Most hosts are one or the other
- Identifying clients and servers
- Recall that there is a single namespace for ports
- Heuristic the more common port is the server
37Weighted bipartite digraph
38(No Transcript)
39Multiple digraphs
Port 80 (Web)
Port 6346 (Gnutella)
Port 19101 (???)
Port 25 (Mail)
40Application correlation
- Consider the out-strength of a client in the
networks for ports p and q
41Application correlation
- Build a pair of vectors from the distribution of
strength values
42Application correlation
- Examine the cosine similarity of the vectors
- When s 0, applications p and q are never used
together. - When s 1, applications p and q are always used
together, and to the same extent.
43Clustering applications
- We now have s(p, q) for every pair of ports
- Convert these similarities into distances
- If s 0, then d is large if s 1, then d 0
- Now apply Wards hierarchical clustering algorithm
44(No Transcript)
45Natural clusters
- The Web
- Correlated with almost every application
- Use is nearly universal
- Traditional applications
- Includes mail, FTP, news, remote access, etc.
- Characterized by dedicated servers
- Peer-to-peer applications
- Includes file sharing Gnutella, BitTorrent, etc.
- Users often use several of these
46Classifying unknownapplications
- To classify an unknown application, see what
known applications it clusters with - Our classification experiment
- Take 16 unknown ports
- Guess function based on similarity data
- Validate or invalidate guesses based on external
evidence
47Example 1
- Port 388 is coupled with FTP and Hotline
- FTP is a file transfer application
- Hotline is an early file-sharing application
- Our guess traditional file transfer application
- Actual identity Unidata/LDM
- Used for moving large meteorological data sets
48Example 2
- Port 19101 is coupled with instant messaging and
P2P applications - Our guess a P2P application that relies on
individual contact for file transfers - Actual identity Clubbox
- Korean file-sharing program
- Users trade large files on virtual hard drives
49(No Transcript)
50Overall results
- For our 16 guesses
- 8 were unambiguously correct
- 6 were partially correct
- These turned out to be trojans and malware
- We learned that IRC P2P evil afoot
- 2 could not be confirmed or disproven
- Ports were in transient use during data collection
51Implications
- We can identify the type of an application
without examining a single packet! - Scalable
- Preserves user privacy
- Difficult to do with relational view of flow data
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57Broader application
- Generic view of the situation
- Weighted network of entities derived from
activity with labeled classes of interaction - Find the sub-network for each labeled class
- Use the network distributions to calculate
similarity scores for the classes - Use the similarity scores to cluster the classes
- Classify unknown classes using these clusters
58Thank you!