Title: VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability
1VERNIERVirtualized Execution Realizing Network
Infrastructures Enhancing Reliability
- Project Overview
- July 2006
2Background
- Commercial-off-the-shelf (COTS) software
- Large organizations, including DoD, have become
dependent on it - Yet, most COTS software is not dependable enough
for critical applications - Security breaches
- Misconfiguration
- Bugs
- Large, homogeneous COTS deployments, such as
those in DoD, accentuate the risk, since many
users - Experience the same failures caused by the same
vulnerabilities, configuration errors, and bugs - Suffer the same costly, adverse consequences
- Alternatives, such as government-funded
development of high-assurance systems present
significant barriers in - Cost
- Functionality
- Performance
3VERNIER Project Objectives
- Develop new technologies to deliver the benefits
of scaling techniques to large application
communities - Provide enhanced survivability to the DoD
computing infrastructure - Enhance the cost, functionality, and performance
advantages of COTS computing environments - Investigate and develop new technologies aimed at
enabling communities of systems running similar,
widely available COTS software to perform more
robustly in the face of attacks and software
faults - Deliver a demonstrated, functioning,
transition-ready system that implements these new
AC survivability technologies - Technical approach Augmented virtual machine
monitor - Commercial transition partner VMware, Inc.
4Project Scope
- Collaborative detection and diagnosis of failures
- Collaborative response to failures
- Advanced situational awareness capabilities
- Collective understanding of community state
- Predictive capability Early warning of potential
future problems - Key goal turn the size and homogeneity of the
user community into an advantage by converting
scattered deployments of vulnerable COTS systems
into cohesive, survivable application communities
that detect, diagnose, and recover from their own
failures - What COTS?
- Microsoft Windows, IE, Office suite, and the like
5Research Challenges
- Extracting behavioral models from binary programs
- Breakthrough novel techniques required
- Quasi-static state analysis for black-box
binaries -
- Scaled information sharing
- Networked application communities sharing
knowledge about the software they run - Intelligent, comprehensive recovery
- Predictive situational awareness
- Automatic, easy-to-understand gauges
6Breakthrough Capabilities
7Expected Results and Impact
- COTS Product (VMware) with breakthrough
capabilities for application communities - Scalability to 100K nodes running augmented
VMware and custom Vernier software - Automatic collaborative failure diagnosis and
recovery - Survivable robust system
- Community-aware solution
8VERNIER Team
- SRI International, Menlo Park, CA
- Patrick Lincoln, Principal Investigator
- Steve Dawson, Project manager integration
- Linda Briesemeister, Knowledge sharing
collaborative response - Hassen Saidi, Learning-based diagnosis code
analysis situation awareness - Stanford University
- John Mitchell, Stanford PI code analysis
host-based detection and response - Dan Boneh, Knowledge sharing protocols
- Mendel Rosenblum, VMM infrastructure
collaborative response transition liaison - Alex Aiken, Quasi-static binary analysis
- Liz Stinson, Botswat system security
- Palo Alto Research Center (PARC)
- Jim Thornton, PARC PI configuration monitoring
and response situation awareness - Dirk Balfanz, Community response management
- Glenn Durfee, Configuration monitoring and
response situation awareness - Technology transition partner VMWare, Inc.
9VERNIER Technical Approach
10Notional Host System Architecture
11An Abstraction-Based Diagnosis Capability for
VERNIER
12Objectives
- Based on the general principle much of security
amounts to making sure - that an application does what it is suppose to
do.. and nothing else! - Build models of applications behaviors (what the
application is suppose to do). - Monitor applications behavior and report
malfunctions and unintended behaviors (deviations
from behavior). - Use the recorded execution traces as raw data to
a set of abstraction-based diagnosis engines (why
did the deviation from good intended behavior
occurredto the extent to which we can do a good
job answering such question). - Share the state of alerts and diagnosis among the
nodes of the community (sharing the bad news.but
also the good ones!). - Aggregate the diagnosis outputs and the alerts
into a situation awareness gauge.
13Approach
- We combine a set of well known and well
established techniques - building increasingly accurate models of
applications behaviors - Static analysis combined with predicate
abstraction to build Dyck and CFG models used for
static analysis-based intrusion detection - Implement mechanisms for monitoring sequences of
states and actions of an application for the
following purposes - Check if a known bad sequence is executed
(signature-based!) - Check for previously unknown variations of known
bad sequences (correlation!) - Find root-causes for unexpected malfunction and
malicious exploits (Diagnosis) - Diagnosis is performed using techniques borrowed
from - Delta-debugging (root-cause diagnosis)
- Anomaly detection (correlation)
- The situation awareness gauge is implemented as a
platform independent web interface
14Monitoring-Based Diagnosis
- We combine these techniques into two phases
- Monitoring Applications are monitored and
sequences of executions along with configurations
are stored. - Diagnosis Differences between good runs and bad
runs are the first clues used for diagnosis - Traces of executions are sequences of
- System calls
- Method calls
- Changes in configurations
- The more information is stored, the better chance
that malfunctions and malicious behaviors are
properly diagnosed.
15Quasi-static binary analysis and predicate
abstraction-based intrusion detection
- Use static analysis for recovering the control
flow graph the application. - CFG generated by compliers for source code.
- Recover class hierarchy for object code of OO
applications. - Build a pushdown system which is a model that
represents an over approximation of the sequences
of methods and system calls of the application. - Deal with context sensitivity to match exit calls
to return locations. - Use predicate abstraction and data flow analysis
to refine the pushdown system and obtain a more
accurate model. - Improving the knowledge about arguments to
monitored calls.
16Better Models and Better Monitoring
- We are not just interested in detection
intrusions, but by - also generating high-level explanations of why an
- application deviates from its intended behavior.
- CFG and Dyck models are all over-approximations
of the applications behavior (potential attacks
are only discovered when the application behavior
deviates from the model). - We will use the runs of the application to
generate under-approximations of the applications
behavior! - Alternatively, ever model representing an
over-approximation has a dual that represents an
under-approximation (over and under-approximations
dont have to be the same type of models!). - We will combine over and under approximation to
reduce the risk of missing possible attacks. - We will refine the over and under approximations
to improve the application model.
17Combining over and under approximations
Over approximation (constructed by static
analysis)
Under approximation (constructed from runs)
18What if we dont have a model of the application?
- We can monitor the application as a blackbox and
intercept system calls - Learn a model of good behaviors
- Learn a model of bad behaviors
- Anomalies are difference between good and bad
behaviors - Borrow from delta-debugging techniques to find
root-causes of misbehaviors
19Configuration-based Detection, Diagnosis,
Recovery, and Situational Awareness
20Importance of Configuration
- Static configuration state highly correlated with
system behavior - Many attacks/bugs/errors introduced by way of a
substantive change to configuration - A central problem in system administration is
the construction of a secure and scalable scheme
for maintaining configuration integrity of a
computer system over the short term, while
allowing configuration to evolve gradually over
the long term Mark Burgess, author of cfengine
21AC Opportunity
- Leverage scale of population to learn what are
bad states in configuration space
Today Every configurationchange is an
uncontrolledexperiment
AC Future Configurationchanges managed as
controlledreversible trials
22Live Monitoring of Configuration State
- State analysis
- Comparative diagnosis
- Vulnerability assessment
- Clustering similar nodes and contextualizing
observations - Detect change events
- Cluster low-level changes into transactions
- Log events for problem detection, mitigation and
user interaction - Share events in real-time for situational
awareness - Active learning
- Automated experiments to isolate root causes
- Managed testing of official changes like patch
installation
23Live Control of Configuration State
- Modification for Reversibility and
Experimentation - Coarse-grained VM rollback
- Medium-grained Installer/Uninstaller activation
- Fine-grained Direct manipulation of low-level
state elements - Prevention
- In-progress detection of changes
- Interruption of change sequence
- Reversal of partial effects
24Identifying Badness
- Objective Deterministic Criteria
- Rootkit detection from structural features
- Published attack signatures
- Objective Heuristic Criteria
- Performance outside of normal parameters
- Subjective End-User Report
- Dialog with user to gather info, e.g. temporal
data for failure appearance - Administrative Policy
- Rules specified by administrators within community
25Local Components
Community
3
App VM
VERNIER VM
Experimental VM
COTS
Console(UI)
Comm
Diag
App 1
App 2
App 1
App 2
Agent
Agent
VERNIER Monitor/Control
1
1
App OS
App OS
VERNIER OS Base
2
VMM (VM Kernel)
26Key Interfaces
VERNIER-Agent (TCP/IP, XML?) Registry change
events Filesystem change events Install
events Manipulate registry Manipulate
filesystem Control System Restore
VERNIER-VMM (?) Suspend Resume Checkpoint Revert C
lone Reset Lock memory Process events Read
memory Read/write disk
1
2
3
- VERNIER-Community
- (?)
- Cluster management
- Experience reports
- Unknown
- Prevalent
- Known Bad
- Presumed Good
- State exchange
- Experiment request/response
27Local Functions
NetworkTap
Communication Manager
Console
ResponseController
Analysis Diagnosis
Configuration Analysis
AgentInside
Event Stream
BehaviorAnalysis
TrafficAnalysis
Local DB Local condition detail Event
logs Labeled condition signatures State
snapshots Experimental data
VMM
Firewall
28Adapting and Extending Host-based, Run-time Win32
Bot Detection for VERNIER
29Exploit botnet characteristic ongoing command
and control
- Network-based approaches
- Filtering (protocol, port, host, content-based)
- Look for traffic patterns (e.g. DynDNS Dagon)
- Hard (encrypt traffic, permute to look like
normal traffic, ) botwriters control the
arena. - Host-based approaches
- Ours Have more info at host level.
- Since the bot is controlled externally, use this
meta-level behavioral signature as basis of
detection
30Our approach
- Look at the syscalls made by a program
- In particular at certain of their args our
sinks - Possible sources for these sinks
- local mouse, keyboard, file I/O,
- remote network I/O
- An instance of external control occurs when data
from a remote source reaches a sink - Surprisingly works really well for all bots
tested (ago, dsnx, evil, g-sys, sd, spy), every
command that exhibited external control was
detected
31Big picture
32Design
33Two modes
- Cause-and-effect semantics
- Tight relationship between receipt of some data
over network and subsequent use of some portion
of that data in a sink - Correlative semantics looser relationship
- Use of some data that is the same as some data
received over the network - Why necessary?
34Behaviors ideally disjoint_at_ lowest level in
call stack
35Correlative semantics
- Why necessary
- Why bots with C library functions statically
linked in unconstrained OOB copies - In general almost as good as cause-and-effect
semantics (stat vs. dyn link) - Exceptions cmds that format recvd params (e.g.
via sprintf)
36Benign program testing
- Tested against some benign programs that interact
with the network - Firefox, mIRC, Unreal IRCd
- 3 contextual false positives
- IRCd sent on X heard on Y
- Firefox dereferencing embedded links
- Artificial false positives quite a few
- mIRC DCC capabilities
- Firefox saving contents to a file,
37False positives
- contextual false positives not present in bots
- external control heuristic correctly detected but
these actions under these circumstances widely
accepted as non-malicious - artificial false positives not present in bots
- def of external control implies no user input
agreeing to particular behavior - but we dont track explicitly clean data (that
received via kb, mouse) - spurious false positives
- any other incorrect flagging of external control
38Our mechanism review
- Single behavioral meta-signature detects wide
variety of behaviors on majority of Win32 bots - Resilient to differences in implementation
- Resilient in face of unconstrained OOB copies
- Resilient to encryption w/some constraints
- Resilient to changes in command-and-control
protocol (e.g. from IRC to HTTP) and parameters
(e.g. for rendezvous point)
39Knowledge Sharing in VERNIER
40Knowledge Sharing
- Need Communication is the core concept of a
community - Application communities rely on ability to share
knowledge Reliable, Efficient, Authentic, Secure - Approach two-tier peer-to-peer platform
- Tuple space (ala Linda)
- Considering JavaSpaces implementation of tuple
spaces - Two-tier for better scalability
- If needed, hypercube hashtable index (ala
Obreiter and Graf) - Benefits Reliable, efficient (local) knowledge
sharing - Competition Other possible methods for knowledge
sharing include explicit messaging, centralized
database, and statically indexed knowledge
structures. - Other approaches lack scalability, are
unreliable, and can bedifficult to secure
41Knowledge Sharing Levels
- Lower level (within a cluster)
- Tuple space (ala Linda (Gelernter))
- Simple queries
- (, name, ) returns records regarding name
- Concurrent access and update
- Higher level (supernodes)
- Nodes aggregate knowledge of an entire cluster
- Use abstraction to summarize current situation
- Application-level multicast to push out summaries
- Supernode pushes all summary updates into local
tuple space
42Group Communication
- Group communication is key
- For higher level, certain usual assumptions
- Reliable delivery
- Ordered message delivery
- Spread (www.spread.org) as a basis for
implementation of group communication - Building on secure spread and progress software
(progress.com)s more secure, reliable, scalable
variants of spread
43Group Communication Security and Privacy
Secrecy and Authenticity
- Security and privacy are critical aspects of
VERNIER - Must authenticate reports and ensure correctness
- Confidentiality of reports
- Protecting user privacy (my files, my keystrokes)
- Protect aspects of applications
- Protect configuration information
- Protect vulnerability detection information
- Community members send status reports to local
supernode - Reports propagated throughout network
44Group Communication Security
- Defense against
- network attacks sending forged messages to
supernodes - PKI
- Compromised community member sending false
reports - statistical anomaly detection (eg EMERALD)
- Virtualization
- Any report generated within compromised virtual
machine must be consistent with what is observed
outside the virtualization layer
45Group Communication Security
- Secure audit logs
- Secure log of all P2P status reports
- Enable post-mortem analysis on detected attacks
- Cryptographic protection of log (Boneh, Waters)
- Sanitizing stats reports
- Status reports reveal private information
- Special encryption enabling read only by
credentialed membersand search (as in search
over encrpyted database) by community - Mitigating denial of service attacks on
supernodes - Re-election of supernodes when under attack
- Securing configuration update messages
- PKI authenticating legitimate reports from
community members
46Schedule, Experimentation, and Evaluation
47Schedule and Milestones
48Experimentation and Evaluation
- Project testbed
- Network of 300 virtual hosts
- 30 server-class physical hosts
- 10 virtual nodes per server
- Three clusters, one at each participant site
- Software
- Host OS Linux
- Guest (community) OS Microsoft Windows
- Applications IE browser (possibly others) MS
Office - Simulations and scalability
- Financially infeasible to scale to thousands of
nodes - Plan is to use hybrid simulation to test
scalability - Real (live) nodes provide actual data
- Simulated nodes use synthesized data generated by
perturbing data collected from real clusters
supernodes
49Proposed Success Criteria
- Metrics and targets (team-defined)
- False positives (FP) / False negatives (FN)
- Phase 1 FP lt 10, FN lt 20
- Phase 2 FP lt 1, FN lt 2 (order of magnitude
improvement) - Percent loss of network availability
- Phase 1 At most 20 per node, with at most 80
over any 500ms interval - Phase 2 At most 5 per node, with at most 20
over any 500ms interval - Average time to recovery
- Phase 1 Assuming a fix exists (not a FN), at
most 30 minutes to recover the entire community - Phase 2 At most 10 minutes
- Average network and computational overhead
- No more than 30 slowdown for applications
- No more than 100 KB/s average VERNIER-induced
network traffic per node - Percent accuracy of prediction
- Phase 1 Effects of problems predicted within 15
minutes of onset set of nodes wrongly predicted
(either way) differs by no more than 40 of
actual - Phase 2 Prediction within 5 minutes predicted
set differs by no more than 20