Title: System Administration: Drowning in Management Complexity
1System AdministrationDrowning in Management
Complexity
- Chad Verbowski
- Microsoft Research, Redmond
2Overview
- Problem Space
- Complexity Grows Faster than We Can Handle
- System Management Approaches
- A New Approach Data Driven Management
- Examples/Results using Data Driven Mgmt
3Motivation
- Systems Management Complexity Scale
- The amount of Energy we put into Maintaining our
Systems - Energy Software, Hardware, People, Resource
- Complexity is Constantly Growing
- Advances Reducing Development Complexity
- Simplified Development Enables Complex Systems
- Growing Number of Devices, Apps, Users
- Advances Needed in Managing Complexity
- To Avoid Drowning!!!
4Systems Management Problem
- Complexity AND Scale
- Persistent-State Size O(105)
- Persistent-State Access Trace Size O(108) per
day - Number of Programs Interacting O(102)
- Globally
- Number of Machines O(109)
- each runs a different combination of
O(106)programs - CyberSecurity (Anti-Malware)
- Systems Management Adversaries
- Digital Rights (DRM), Protecting Data
- Cybersecurity Untrusted Users
5Complexity ComparisonWith Humans
- Human DNA 1 Billion Base Pairs 1GB
- 0.25 Unique Pairs 1.2 MB
- 6 Billion people 7.2 TB
- Encoding of Relatives (3.61) 2 TB
- Lempel-Ziv Compress (101) 200 GB
- All Living Peoples DNA Fit on a Laptop!
- Equivalent to 72 Gmail accounts OR 4 Blu-Ray
Discs - People Ever Born (106,456,367,669) O(1011)
- Storage Required for all Human DNA 3.5 TB
- Cost (Using 14k for 18TB) 2,725
- Backup (100 for 1TB Tape) 400
6Growth in Software Complexity
- Rate a Developer Can Code
- OO/CORBA/COM Enables Componentization
- Libraries Enable Sharing Code
- New Languages Less Coding
- Better Tools Easier to Debug, Build, Annotate
- Density of Developer Collaboration
- Source Control Systems
- Improved Communication (Email, IM)
- Enforceable Software Development Processes
- Hardware Advancements
- Software is a gas that expands to fill its
container
7Developers Role in Manageability
- Rely on them for Manageable Software?
- Probably Not At least not for a while
- Time to Adopt new platforms, applications, APIs
- Third Party, In House, Legacy Software
- Can They Completely Solve This?
- Not Feeling the Pain of System Administrators
- Manageability is the Top Priority, Right After
- Easy to Manage Hardware?
- Very Few Advancements in this area
- State-of-the-Art SNMP v1 Circa. 1988
8The Software LifeCycle
9The Management LifeCycle
- Software LifeCycle ! Management LifeCycle
- ONGOING Cost That Starts With Deployment
- Configuration, Provisioning
- Monitoring, Troubleshooting
- Upgrades, Patching
- Integration with Other Components
- Accumulation of Stuff To Manage
- Applications, Hardware, Devices, Users, and Data
- Ops Cost gtgt Software Cost
- BIG Trouble Unless Significant Improvement
10Who Is Going to Save Us?
- Sys-Admins Are Ultimately Responsible
- They Understand the Symptoms Best
- Limited Time Toolset for Fixing Manageability
- Need Better Management Tools (Obviously)
- Their Net Affect Should Not Be More Complexity!
- They Need to Take Virtually No Input or
Configuration - They Should Not Rely On Application Participation
- ARE THESE IMPOSSIBLE CONSTRAINTS??
11Motivation From Albert Einstein
- Any fool can make things bigger, more complex,
and more violent. It takes a touch of genius-and
a lot of courage-to move in the opposite
direction.
12System Management TechniquesBad System
Management Can Make Things Worse
- Software Development Design Choices
- Componentization is Good
- But Dont Make Every Class a Component!
- Security Checks, and Locks Are Good
- But Dont Unnecessarily Check/Lock At Every
Layer! - System Management Technique Choices
- No Single Technique Solves All Problems
- Be Aware of the Capability and Limitations
- Use Them Appropriately!
131. Prescriptive ManagementThe First Line of
Defense
- Limit the Hardware and Software Used
- You can only buy THESE Server/Laptop/Desktop
- Only THESE Versions of App X Are Supported.
- Benefits
- Less Stuff to Manage!
- Challenges
- Ongoing Cost to Maintain the List
- Measuring Compliance is Hard
- Difficult to Clean Up Existing Environments
- User Happiness ?
142. Signature BasedAvoid Solving the Same Problem
Over and Over and
- Create Rules/Fingerprints for Known Problems
- (AV/AS) Manual Sample Collection and Signature
Derivation - (Mgmt) Manual Events Rules for Well Known
Problems - Benefits
- Minimal Troubleshooting Time
- Early Problem Detection
- Challenges
- Costly Hard to Identify Root Cause
- The Most Costly Issues Frequently Repeat
153. ManifestDeep System Understanding Enables
Policy Based Management
- Complete Description of Environment State
- Each Items Function is Documented with
Dependencies, Valid Values - Benefits
- Policy Constraints Can be Created and Enforced
- Wide and Deep Knowledge Minimizes Troubleshooting
- Challenges
- Determining What the Policy Should Be
- Virtually Impossible to Create for ALL items
- Third Party, In-House, and Legacy Applications
(?) - Difficulty Resolving Late-Bound dependencies,
Canonicalization Issues - Costly to Create a Manifest for Large
Applications - Keeping the Manifest Current is Challenging
164. Simplified Management ModelReduce Complexity
by Creating a Simpler Management Abstraction
- Manage a Simplified Logical View
- Complexity is Encapsulated in Components Forming
a Logical View - e.g. A Service Description, and a Service
Level Agreement - Benefits
- The Management Space is Less Complex
- Challenges
- Hard to Define the Right Abstraction for
Everyone - Creating the Model Definition
- Mapping to New Model is Hard
- Equivalence Across Vendor/Application/Version
- Keeping the Real-World and Logical View in Sync
17Motivation For a New TechniqueHard to Solve
Real-World Change Management Problems
- My application worked yesterday, but its not
working today. Whats the problem? - My system has been acting weird lately. What has
changed? - If I apply this patch, which of the 3,000
applications in my company may break? - Was this change consistently applied to all 850
of my servers? - Some spyware program is hijacking my home page.
How can I get rid of it, all of it? - Are there any Trojan programs hiding on my
machine and stealing my bank account passwords?
18Change Management Struggle
Applications
App Popularity
App Versions
19InsightsA Pragmatic Look at Change Management
- Cross Machine State CANT Be That Different
- Most of the O(109) Systems Are Working Correctly
- Most Environments Have Small Variation in
Settings - System Workloads are Highly Repetitive
- We Only Care About The State That Is Used
- Only 10 of Files and Settings Are Actually Used
- Process / State Interactions Provides Context
- For Understanding Process Dependencies and the
State - We Only Care About New System Changes
- Only 1 of Files and Settings Typically Change
205. Data Driven Management Reduce Complexity
using Automated Monitoring and Analysis
- Manage Only Globally Distinct Differences
- Instrument the OS to Auto Track Process/State
Interactions - Identify New Process Patterns and State
Differences - Benefits
- Simplifies the Troubleshooting Problem Space
- Reduces the Problem Space for Other Techniques
- Leverage Existing Machine Learning Work
- Challenges
- Scalable Low Overhead Data Collection and
Analysis - Determining Cross Machine Equivalence
- False Positives
21System Building Challenges
22Data Driven Examples
- Troubleshooting Strider Peer Pressure
- Spyware Detection GateKeeper
- Patch Impact Analysis
- Root Kit Detection Ghost buster
- Exploit Site Discovery Honey Monkey
- Closing the Change Mgmt Loop LiveOps
23Strider Troubleshooter
- My application worked yesterday, but its not
working today. Whats the problem? - Cross-time Diff O(105) ? O(103)
- Windows XP System Restore Registry snapshot
- Trace the app O(105) ? O(103)
- Registry read/write operations
- Diff-Trace Intersection O(103) ? O(101)
- Inverse Change Frequency Ranking
- GeneBank PeerPressure Ranking
- Mostly good Registry snapshots from the Mass for
detecting anomalies
O(101) ? O(100)
24Experimental Results
25AskStrider Auto-Scanner
- My system has been acting weird lately. What has
changed? - Running-module Snapshot O(105) ? O(103)
- Earliest-Latest Diff O(105) ? O(103)
- Diff-Snapshot Intersection O(103) ? O(102)
- Last-Update Timestamp Ranking O(102) ? O(101)
- Patch Filtering O(101) ? O(100)
- During patch troubleshooting focus on files from
patches - During malware troubleshooting filter out files
from patches as noise
26(No Transcript)
27Patch Impact Analysis
- If I apply this patch, which of the 3,000
applications in my company may break? - Trace patch installation O(105) ? O(101)
- Black-box patch manifest
- For each of the O(103) apps
- Trace it O(105) ? O(103)
- Black-box persistent-state app manifest
- Diff-Trace Intersection O(103) ? O(100) or 0
- Test prioritization O(103) ? 0 O(101)
28Improving OS DesignWhat Extensibility Points
Exist in the OS?
- Extensibility Point Configuration Setting
Containing the File Name of Code To Be Loaded At
Application Runtime - - Used by Malware to Automatically Start After
Reboot - Solution For Each Module Load Identify
Previously Read Settings Containing the Module
Name. - Results
- 364 Classes of EPs with 7227 EP Instances
- 44 of EP Instances were never modified
- Recommendation Lock Down
- 70 of EP Instances were used by a single
application - Recommendation Removal
29Ghostware The Ultimate Challenge to Trustworthy
Computing
- Ghostware
- Malware programs that patch the OS to hide their
files, Registry entries, processes, loaded
modules, network ports, etc. from other
applications and OS utility programs - Bad things they can do
- Install keyloggers to steal information
- Use the disks as free storage
- Use the machines to send spam emails
- Release viruses and worms
30CWS spyware detected by Ad-aware
31(No Transcript)
32CWS Spyware Hidden by Hacker Defender
33GhostBuster ScanDiff
34Strider GhostBuster Ghostware Detector
- Are there any Trojan programs hiding on my
machine and stealing my bank account passwords? - File System Registry Snapshot O(105)
- Snapshot from a WinPE CD O(105)
- Diff of the two snapshots O(105) ? O(101)
- Content-Diff Noise Filtering O(101) ? O(100)
- Only care about files and Registry entries that
exist in the second snapshot, but not the first
one
35LiveOpsClosing The Change Management Loop
!
Person or Automation
LiveOps
Is The Change Approved?
Change Request
Change Tools
Change Detected
OS Applications Platform
36Conclusion
- Think of New Ways to Avoid Complexity
- Not just accept, find better ways to manage it
- Invest in Deep Thinking to Advance Ops
- Not just fighting the fires!
- (The Nearest Way to the Exit May be Behind You!)
- To raise new questions, new possibilities, to
regard old problems from a new angle, requires
creative imagination and marks real advance in
science.