Title: LiveOps: Systems Management as a Service
1LiveOps Systems Management as a Service
- Chad Verbowski, Software Architect, MSR
- Juhan Lee, Xiaogang Liu Microsoft MSN
- Roussi Rousev Florida Institute of Technology
- Yi-Min Wang Microsoft Research
- 12/07/2006 Washington D.C.
2Overview
- Problem Space
- Approach
- Architecture and Challenges
- Deployment and Results
- Tech Transfers, Related and Future Work
3Why Systems Management Is Hard
- Administrators Dont Understand Their Systems
- What processes are valid and should be running?
- Which configuration binary/file/registry does
my app need? - Will this change affect my Line of Business
application? - Management Practices Cannot Be Enforced
- No changes during lockdown periods?
- Consistent updates applied across all machines?
- No copies of source code to removable media?
- Consequences
- Reactive Problems are detected after they
impact reliability - Expensive Entropy in System Configuration
Create Complexity - Makes it harder to Troubleshoot, and mistakes
more likely - Harder to secure systems, easier for malware and
hackers to hide - Unreliable Problems reoccur because the root
cause is not found
4Systems Management Surveys
- Reduce Support/Ops Cost Improve Effectiveness
- More Than 40 of Ops Cost is People
- People Cost Scales Linearly with Managed Servers
- Improve Server Reliability
- 33 of Outages caused by Human Error
- 76 of Time-To-Repair is operator activity
- What changes impact this app and who/what made
them? - The most costly problems are not the most common
- Improve Application Performance
- 33 of operator time is spent on optimization
- Which process is causing the system to be
sluggish? - What resource is hanging my app, and which app
has it?
5The Change Management Process
!
Person or Automation
LiveOps
Is The Change Approved?
Change Request
FDR
Change Tools
Change Detected
OS Applications Platform
6(No Transcript)
7Overview
- Problem Space
- Approach
- Architecture and Challenges
- Deployment and Results
- Tech Transfers, Related and Future Work
8The Traditional Approaches
- Signature Based Accept False (-) for Low False
() - Rather than complete analysis look for known bad
- (AV/AS) Manual Sample Collection and Signature
Derivation - (Mgmt) Manual events rules for well known
problems - (Requires tuning to specific environments)
- Manifest Manual Specification of Dependencies
- Coverage for THINLeg Applications is hard
- (Third Party, In-House, and Legacy Applications)
- Resolving Late-Bound dependencies is difficult
- Expensive to create a manifest for large
applications - Keeping the manifest current is challenging
9Our Approach
- Trace All Interactions Between Applications and
Configuration Interactions Provide Context for
Understanding Configuration - Completeness Enables detection of anomalies we
have not seen before - Comprehensive always on black-box tracing of
low-level system activities - All File, Registry Process, and Module Load
activities - Average of 20 MB/day, no discernable overhead
- Extensive Cross-Time/Machine Analysis Minimizes
False Positives - Automatically identifies patterns, baselines, and
deviations - 3 seconds per machine day of data to
analyze/process
10Flight Data Recorder (FDR)
- Airplanes Have Black Box Recorders
- Track Performance Parameters and Cockpit Audio
- Provide A TimeLine of Events For Understanding
What Happened - FDR is The Black Box Recorder for Windows
- All File, Registry, Module, Process interactions
- Who, What, When, and How State is Used and
Modified
11Overview
- Problem Space
- Approach
- Architecture and Challenges
- Deployment and Results
- Tech Transfers, Related and Future Work
12LiveOps Architecture and Data Flow
13Challenges
14Overview
- Problem Space
- Approach
- Architecture and Challenges
- Deployment and Results
- Tech Transfers, Related and Future Work
15FDR Agent Deployment
- Current Deployment 600 Machines
- 500 Severs from 15 MSN properties
- 107 Corporate Desktops
- 1350 Distinct systems have provided data over 24
months - Expanding Deployment 6500 Servers
16Server Deployment Results
- Critical Changes
- Lockdown Violations analysis over 12 months
- Most properties had at least 1 violation during
each lockdown - Deleting Server Page File Settings
- Happens every 2-3 months across 10s of servers
- FDR tracked it to a remote registry change,
likely from a script - Daily Changes
- Typically more than 1 change impacts OS and LOB
applications
17Server Deployment Results
- Forensics
- 29 of systems have run unauthorized processes
- Email clients, Media Player, Java auto-updating
clients - 8 processes that could not be identified by
security experts - mlconv.exe, monnow.exe, siteremover.exe,
lsacacheagent.exe, - Performance
- LOB application reads crypto keys 240
times/second - Management agent continuously reading all service
entries - Security Best Practices
- 1/3rd of a systems were found to be running
screen savers - 6 services not running within the machine account
context
18LiveOps Scenarios
- Impact Analysis
- Forensic Investigation
- Enforcing Policies
19Scenario 1 Impact AnalysisStale Binary after
Patch Installation
20FDR Scenario 2 Forensic InvestigationWhere did
that process / file come from?
1
2
21Historical Contextual Information (Drill In)
22Scenario 3 Enforcing PoliciesWas that an
authorized and planned change?
23Questions?