VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability - PowerPoint PPT Presentation

About This Presentation

Title:

VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability

Description:

Static analysis combined with predicate abstraction to build Dyck and CFG models ... Quasi-static binary analysis and predicate abstraction-based intrusion detection ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 50

Provided by: steven237

Category:

more less

Transcript and Presenter's Notes

Title: VERNIER Virtualized Execution Realizing Network Infrastructures Enhancing Reliability

1
VERNIERVirtualized Execution Realizing Network
Infrastructures Enhancing Reliability

Project Overview
July 2006

2
Background

Commercial-off-the-shelf (COTS) software
Large organizations, including DoD, have become
dependent on it
Yet, most COTS software is not dependable enough
for critical applications
Security breaches
Misconfiguration
Bugs
Large, homogeneous COTS deployments, such as
those in DoD, accentuate the risk, since many
users
Experience the same failures caused by the same
vulnerabilities, configuration errors, and bugs
Suffer the same costly, adverse consequences
Alternatives, such as government-funded
development of high-assurance systems present
significant barriers in
Cost
Functionality
Performance

3
VERNIER Project Objectives

Develop new technologies to deliver the benefits
of scaling techniques to large application
communities
Provide enhanced survivability to the DoD
computing infrastructure
Enhance the cost, functionality, and performance
advantages of COTS computing environments
Investigate and develop new technologies aimed at
enabling communities of systems running similar,
widely available COTS software to perform more
robustly in the face of attacks and software
faults
Deliver a demonstrated, functioning,
transition-ready system that implements these new
AC survivability technologies
Technical approach Augmented virtual machine
monitor
Commercial transition partner VMware, Inc.

4
Project Scope

Collaborative detection and diagnosis of failures
Collaborative response to failures
Advanced situational awareness capabilities
Collective understanding of community state
Predictive capability Early warning of potential
future problems
Key goal turn the size and homogeneity of the
user community into an advantage by converting
scattered deployments of vulnerable COTS systems
into cohesive, survivable application communities
that detect, diagnose, and recover from their own
failures
What COTS?
Microsoft Windows, IE, Office suite, and the like

5
Research Challenges

Extracting behavioral models from binary programs
Breakthrough novel techniques required
Quasi-static state analysis for black-box
binaries
Scaled information sharing
Networked application communities sharing
knowledge about the software they run
Intelligent, comprehensive recovery
Predictive situational awareness
Automatic, easy-to-understand gauges

6
Breakthrough Capabilities
7
Expected Results and Impact

COTS Product (VMware) with breakthrough
capabilities for application communities
Scalability to 100K nodes running augmented
VMware and custom Vernier software
Automatic collaborative failure diagnosis and
recovery
Survivable robust system
Community-aware solution

8
VERNIER Team

SRI International, Menlo Park, CA
Patrick Lincoln, Principal Investigator
Steve Dawson, Project manager integration
Linda Briesemeister, Knowledge sharing
collaborative response
Hassen Saidi, Learning-based diagnosis code
analysis situation awareness
Stanford University
John Mitchell, Stanford PI code analysis
host-based detection and response
Dan Boneh, Knowledge sharing protocols
Mendel Rosenblum, VMM infrastructure
collaborative response transition liaison
Alex Aiken, Quasi-static binary analysis
Liz Stinson, Botswat system security
Palo Alto Research Center (PARC)
Jim Thornton, PARC PI configuration monitoring
and response situation awareness
Dirk Balfanz, Community response management
Glenn Durfee, Configuration monitoring and
response situation awareness
Technology transition partner VMWare, Inc.

9
VERNIER Technical Approach
10
Notional Host System Architecture
11
An Abstraction-Based Diagnosis Capability for
VERNIER
12
Objectives

Based on the general principle much of security
amounts to making sure
that an application does what it is suppose to
do.. and nothing else!
Build models of applications behaviors (what the
application is suppose to do).
Monitor applications behavior and report
malfunctions and unintended behaviors (deviations
from behavior).
Use the recorded execution traces as raw data to
a set of abstraction-based diagnosis engines (why
did the deviation from good intended behavior
occurredto the extent to which we can do a good
job answering such question).
Share the state of alerts and diagnosis among the
nodes of the community (sharing the bad news.but
also the good ones!).
Aggregate the diagnosis outputs and the alerts
into a situation awareness gauge.

13
Approach

We combine a set of well known and well
established techniques
building increasingly accurate models of
applications behaviors
Static analysis combined with predicate
abstraction to build Dyck and CFG models used for
static analysis-based intrusion detection
Implement mechanisms for monitoring sequences of
states and actions of an application for the
following purposes
Check if a known bad sequence is executed
(signature-based!)
Check for previously unknown variations of known
bad sequences (correlation!)
Find root-causes for unexpected malfunction and
malicious exploits (Diagnosis)
Diagnosis is performed using techniques borrowed
from
Delta-debugging (root-cause diagnosis)
Anomaly detection (correlation)
The situation awareness gauge is implemented as a
platform independent web interface

14
Monitoring-Based Diagnosis

We combine these techniques into two phases
Monitoring Applications are monitored and
sequences of executions along with configurations
are stored.
Diagnosis Differences between good runs and bad
runs are the first clues used for diagnosis
Traces of executions are sequences of
System calls
Method calls
Changes in configurations
The more information is stored, the better chance
that malfunctions and malicious behaviors are
properly diagnosed.

15
Quasi-static binary analysis and predicate
abstraction-based intrusion detection

Use static analysis for recovering the control
flow graph the application.
CFG generated by compliers for source code.
Recover class hierarchy for object code of OO
applications.
Build a pushdown system which is a model that
represents an over approximation of the sequences
of methods and system calls of the application.
Deal with context sensitivity to match exit calls
to return locations.
Use predicate abstraction and data flow analysis
to refine the pushdown system and obtain a more
accurate model.
Improving the knowledge about arguments to
monitored calls.

16
Better Models and Better Monitoring

We are not just interested in detection
intrusions, but by
also generating high-level explanations of why an
application deviates from its intended behavior.
CFG and Dyck models are all over-approximations
of the applications behavior (potential attacks
are only discovered when the application behavior
deviates from the model).
We will use the runs of the application to
generate under-approximations of the applications
behavior!
Alternatively, ever model representing an
over-approximation has a dual that represents an
under-approximation (over and under-approximations
dont have to be the same type of models!).
We will combine over and under approximation to
reduce the risk of missing possible attacks.
We will refine the over and under approximations
to improve the application model.

17
Combining over and under approximations
Over approximation (constructed by static
analysis)
Under approximation (constructed from runs)
18
What if we dont have a model of the application?

We can monitor the application as a blackbox and
intercept system calls
Learn a model of good behaviors
Learn a model of bad behaviors
Anomalies are difference between good and bad
behaviors
Borrow from delta-debugging techniques to find
root-causes of misbehaviors

19
Configuration-based Detection, Diagnosis,
Recovery, and Situational Awareness
20
Importance of Configuration

Static configuration state highly correlated with
system behavior
Many attacks/bugs/errors introduced by way of a
substantive change to configuration
A central problem in system administration is
the construction of a secure and scalable scheme
for maintaining configuration integrity of a
computer system over the short term, while
allowing configuration to evolve gradually over
the long term Mark Burgess, author of cfengine

21
AC Opportunity

Leverage scale of population to learn what are
bad states in configuration space

Today Every configurationchange is an
uncontrolledexperiment
AC Future Configurationchanges managed as
controlledreversible trials
22
Live Monitoring of Configuration State

State analysis
Comparative diagnosis
Vulnerability assessment
Clustering similar nodes and contextualizing
observations
Detect change events
Cluster low-level changes into transactions
Log events for problem detection, mitigation and
user interaction
Share events in real-time for situational
awareness
Active learning
Automated experiments to isolate root causes
Managed testing of official changes like patch
installation

23
Live Control of Configuration State

Modification for Reversibility and
Experimentation
Coarse-grained VM rollback
Medium-grained Installer/Uninstaller activation
Fine-grained Direct manipulation of low-level
state elements
Prevention
In-progress detection of changes
Interruption of change sequence
Reversal of partial effects

24
Identifying Badness

Objective Deterministic Criteria
Rootkit detection from structural features
Published attack signatures
Objective Heuristic Criteria
Performance outside of normal parameters
Subjective End-User Report
Dialog with user to gather info, e.g. temporal
data for failure appearance
Administrative Policy
Rules specified by administrators within community

25
Local Components
Community
3
App VM
VERNIER VM
Experimental VM
COTS
Console(UI)
Comm
Diag
App 1
App 2
App 1
App 2
Agent
Agent
VERNIER Monitor/Control
1
1
App OS
App OS
VERNIER OS Base
2
VMM (VM Kernel)
26
Key Interfaces
VERNIER-Agent (TCP/IP, XML?) Registry change
events Filesystem change events Install
events Manipulate registry Manipulate
filesystem Control System Restore
VERNIER-VMM (?) Suspend Resume Checkpoint Revert C
lone Reset Lock memory Process events Read
memory Read/write disk
1
2
3

VERNIER-Community
(?)
Cluster management
Experience reports
Unknown
Prevalent
Known Bad
Presumed Good
State exchange
Experiment request/response

27
Local Functions
NetworkTap
Communication Manager
Console
ResponseController
Analysis Diagnosis
Configuration Analysis
AgentInside
Event Stream
BehaviorAnalysis
TrafficAnalysis
Local DB Local condition detail Event
logs Labeled condition signatures State
snapshots Experimental data
VMM
Firewall
28
Adapting and Extending Host-based, Run-time Win32
Bot Detection for VERNIER
29
Exploit botnet characteristic ongoing command
and control

Network-based approaches
Filtering (protocol, port, host, content-based)
Look for traffic patterns (e.g. DynDNS Dagon)
Hard (encrypt traffic, permute to look like
normal traffic, ) botwriters control the
arena.
Host-based approaches
Ours Have more info at host level.
Since the bot is controlled externally, use this
meta-level behavioral signature as basis of
detection

30
Our approach

Look at the syscalls made by a program
In particular at certain of their args our
sinks
Possible sources for these sinks
local mouse, keyboard, file I/O,
remote network I/O
An instance of external control occurs when data
from a remote source reaches a sink
Surprisingly works really well for all bots
tested (ago, dsnx, evil, g-sys, sd, spy), every
command that exhibited external control was
detected

31
Big picture
32
Design
33
Two modes

Cause-and-effect semantics
Tight relationship between receipt of some data
over network and subsequent use of some portion
of that data in a sink
Correlative semantics looser relationship
Use of some data that is the same as some data
received over the network
Why necessary?

34
Behaviors ideally disjoint_at_ lowest level in
call stack
35
Correlative semantics

Why necessary
Why bots with C library functions statically
linked in unconstrained OOB copies
In general almost as good as cause-and-effect
semantics (stat vs. dyn link)
Exceptions cmds that format recvd params (e.g.
via sprintf)

36
Benign program testing

Tested against some benign programs that interact
with the network
Firefox, mIRC, Unreal IRCd
3 contextual false positives
IRCd sent on X heard on Y
Firefox dereferencing embedded links
Artificial false positives quite a few
mIRC DCC capabilities
Firefox saving contents to a file,

37
False positives

contextual false positives not present in bots
external control heuristic correctly detected but
these actions under these circumstances widely
accepted as non-malicious
artificial false positives not present in bots
def of external control implies no user input
agreeing to particular behavior
but we dont track explicitly clean data (that
received via kb, mouse)
spurious false positives
any other incorrect flagging of external control

38
Our mechanism review

Single behavioral meta-signature detects wide
variety of behaviors on majority of Win32 bots
Resilient to differences in implementation
Resilient in face of unconstrained OOB copies
Resilient to encryption w/some constraints
Resilient to changes in command-and-control
protocol (e.g. from IRC to HTTP) and parameters
(e.g. for rendezvous point)

39
Knowledge Sharing in VERNIER
40
Knowledge Sharing

Need Communication is the core concept of a
community
Application communities rely on ability to share
knowledge Reliable, Efficient, Authentic, Secure
Approach two-tier peer-to-peer platform
Tuple space (ala Linda)
Considering JavaSpaces implementation of tuple
spaces
Two-tier for better scalability
If needed, hypercube hashtable index (ala
Obreiter and Graf)
Benefits Reliable, efficient (local) knowledge
sharing
Competition Other possible methods for knowledge
sharing include explicit messaging, centralized
database, and statically indexed knowledge
structures.
Other approaches lack scalability, are
unreliable, and can bedifficult to secure

41
Knowledge Sharing Levels

Lower level (within a cluster)
Tuple space (ala Linda (Gelernter))
Simple queries
(, name, ) returns records regarding name
Concurrent access and update
Higher level (supernodes)
Nodes aggregate knowledge of an entire cluster
Use abstraction to summarize current situation
Application-level multicast to push out summaries
Supernode pushes all summary updates into local
tuple space

42
Group Communication

Group communication is key
For higher level, certain usual assumptions
Reliable delivery
Ordered message delivery
Spread (www.spread.org) as a basis for
implementation of group communication
Building on secure spread and progress software
(progress.com)s more secure, reliable, scalable
variants of spread

43
Group Communication Security and Privacy
Secrecy and Authenticity

Security and privacy are critical aspects of
VERNIER
Must authenticate reports and ensure correctness
Confidentiality of reports
Protecting user privacy (my files, my keystrokes)
Protect aspects of applications
Protect configuration information
Protect vulnerability detection information
Community members send status reports to local
supernode
Reports propagated throughout network

44
Group Communication Security

Defense against
network attacks sending forged messages to
supernodes
PKI
Compromised community member sending false
reports
statistical anomaly detection (eg EMERALD)
Virtualization
Any report generated within compromised virtual
machine must be consistent with what is observed
outside the virtualization layer

45
Group Communication Security

Secure audit logs
Secure log of all P2P status reports
Enable post-mortem analysis on detected attacks
Cryptographic protection of log (Boneh, Waters)
Sanitizing stats reports
Status reports reveal private information
Special encryption enabling read only by
credentialed membersand search (as in search
over encrpyted database) by community
Mitigating denial of service attacks on
supernodes
Re-election of supernodes when under attack
Securing configuration update messages
PKI authenticating legitimate reports from
community members

46
Schedule, Experimentation, and Evaluation
47
Schedule and Milestones
48
Experimentation and Evaluation

Project testbed
Network of 300 virtual hosts
30 server-class physical hosts
10 virtual nodes per server
Three clusters, one at each participant site
Software
Host OS Linux
Guest (community) OS Microsoft Windows
Applications IE browser (possibly others) MS
Office
Simulations and scalability
Financially infeasible to scale to thousands of
nodes
Plan is to use hybrid simulation to test
scalability
Real (live) nodes provide actual data
Simulated nodes use synthesized data generated by
perturbing data collected from real clusters
supernodes

49
Proposed Success Criteria

Metrics and targets (team-defined)
False positives (FP) / False negatives (FN)
Phase 1 FP lt 10, FN lt 20
Phase 2 FP lt 1, FN lt 2 (order of magnitude
improvement)
Percent loss of network availability
Phase 1 At most 20 per node, with at most 80
over any 500ms interval
Phase 2 At most 5 per node, with at most 20
over any 500ms interval
Average time to recovery
Phase 1 Assuming a fix exists (not a FN), at
most 30 minutes to recover the entire community
Phase 2 At most 10 minutes
Average network and computational overhead
No more than 30 slowdown for applications
No more than 100 KB/s average VERNIER-induced
network traffic per node
Percent accuracy of prediction
Phase 1 Effects of problems predicted within 15
minutes of onset set of nodes wrongly predicted
(either way) differs by no more than 40 of
actual
Phase 2 Prediction within 5 minutes predicted
set differs by no more than 20