Title: GDC Tutorial, 2005. Building Multi-Player Games
1GDC Tutorial, 2005. Building Multi-Player Games
- Case Study The Sims Online
- Lessons Learned,
- Larry Mellon
2TSO Overview
- Initial team little to no MMP experience
- Engineering estimate switching from 4-8 player
peer to peer to MMP client/server would take no
additional development time! - No code / architecture / tool support for
- Long-term, continually changing nature of game
- Non-deterministic execution, dual platform (win32
/ Linux) - Overall process designed for single-player
complexity, small development team - Limited nightly builds, minimal daily testing
- Limited design reviews, limited scalability
testing, no maintainable/extensible impl.
requirement
3TSO Case Study Outline(Lessons Learned)
- Poorly designed SP ? MP ?MMP transitions
- Scaling
- Team code size, data set size
- Build distribution
- Architecture logical code
- Visibility development operations
- Testability development, release, load
- Multi-Player, Non-determinism
- Persistent user data vs code/content updates
- Patching / new content / custom content
4Scalability(Team Size Code Size)
- What were the problems
- Side effect breaks ability to work in parallel
- Limited encapsulation poor testability
non-determinism TROUBLE - Independent module design impact on overall
system (initially, no system architect) - include structure
- win32 / Linux, compile times, pre-compiled
headers, ... - What worked
- Move to new architecture via Refactoring
Scaffolding - HSB, incSync, nullView Simulator, nullView
client, - Rolling integrations never dark
- Sandboxing pumpkins
5Scalability (Build Distribution)
- To developers, customers fielded servers
- What didnt work (well enough)
- Pulling builds from developers workstations
- Shell scripts manual publication
- What worked well
- Heavy automation with web tracking
- Repeatability, Speed, Visibility
- Hierarchies of promotion test
6Scalability (Architecture)
- Logical versus physical versus code structure
- Only physical was not a major, MAJOR issue
- Logical Replicated computing vs client / server
- Security stability implications
- Code Client / server isolation code sharing
- Multiple, concurrent logic threads were sharing
codedata, each impacting the others - Nullview client simulator
- Regulators vs Protocols bug counts state
machines
7Go to final architecture ASAP
Multiplayer
Client Sim
Evolve
Here be Sync Hell
Client Sim
Client Sim
Client Sim
8Final Architecture ASAPMake Everything
SmallerSeparate
9Final Architecture ASAPReduce Complexity of
Branches
Shared Code
Packet Arrival
If (client)
Client server teams would constantly break each
other via changes to shared statecode
Shared State
If (server)
ifdef (nullview)
Client Event
Server Event
10Final Architecture ASAPRefactoring
- Decomposed into Multiple dlls
- Found the Simulator
- Interfaces
- Reference Counting
- Client/Server subclassing
- How it helped
- Reduced coupling. Even reduced compile times!
- Developers in different modules broke each other
less often. - We went everywhere and learned the code base.
11Final Architecture ASAPIt Had to Always Run
- Initially clients wouldnt behave predictably
- We could not even play test
- Game design was demoralized
- We needed a bridge, now!
?
?
12Final Architecture ASAPIncremental Sync
- A quick temporary solution
- Couldnt wait for final system to be finished
- High overhead, couldnt ship it
- We took partial state snapshots on the server and
restored to them on the client
- How it helped
- Could finally see the game as it would be.
- Allowed parallel game design and coding
- Bought time to lay in the right stuff.
13Architecture Conclusions
- Keep it simple, stupid!
- Client/server
- Keep it clean
- DLL/module integration points
- ifdefs must die!
- Keep it alive
- Plan for a constant system architect role review
all modules for impact on team, other modules
extensibility - Expose control all inter-process communication
- See Regulators state machines that control
transactions
14TSO Case Study Outline(Lessons Learned)
- Poorly designed SP ? MP ?MMP transitions
- Scaling
- Team code size, data set size
- Build distribution
- Architecture logical code
- Visibility development operations
- Testability development, release, load
- Multi-Player, Non-determinism
- Persistent user data vs code/content updates
- Patching / new content / custom content
15Visibility
- Problems
- Debugging a client/server issue was very slow
painful - Knowing what to work on next was largely
guesswork - Reproducing system failures from live environment
- Knowing how one build or server cluster differed
from another was again largely guesswork - What we did that worked
- Log / crash aggregators filters
- Live critical event monitor
- Esper live player engine metrics
- Repeatable load testing
- Web-based Dashboard health, status, where is
everything - Fully automated build publish procedures
16Visibility via Bread Crumbs Aggregated
Instrumentation Flags Trouble Spots
Server Crash
17Quickly Find Trouble Spots
DB byte count oscillates out of control, server
crashes
18Drill Down For Details
A single DB Request is clearly at fault
19TSO Case Study Outline(Lessons Learned)
- Poorly designed SP ? MP ?MMP transitions
- Scaling
- Team code size, data set size
- Build distribution
- Architecture logical code
- Visibility development operations
- Testability development, release, load
- Multi-Player, Non-determinism
- Persistent user data vs code/content updates
- Patching / new content / custom content
20Testability
- Development, release, load all show stopper
problems - QA coordination / speed / cost
- Repeatablity, non-determinism
- Need for many, many tests per day, each with
multiple inputs (two to two thousand players per
test)
21Testability What Worked
- Automated testing for repeatablity scale
- Scriptable test clients mirrored actual user
play sessions - Changed the games architecture to increase
testability - External test harnesses to control 50 test
clients per CPU, 4,000 per session - Push-button UI to configure, run analyze tests
(developer QA) - Constantly updated Baselines, with Monkey Test
stats - Pre-checkin regression
- QA web-driven state machine to control testers
collect/publish results - What didnt work
- Event Recorders, unit testing
- Manual-only testing
22MMP Automated Testing Approach
- Push-button ability to run large-scale,
repeatable tests - Cost
- Hardware / Software
- Human resources
- Process changes
- Benefit
- Accurate, repeatable measurable tests during
development and operations - Stable software, faster, measurable progress
- Base key decisions on fact, not opinion
23Why Spend The Time Money?
- System complexity, non-determinism, scale
- Tests provide hard data in a confusing sea of
possibilities - End users high Quality of Service bar
- Dev team greater comfort confidence
- Tools augment your teams ability to do their
jobs - Find problems faster
- Measure / change / measure repeat as necessary
- Production executives come to depend on this
data to a high degree
24Scripted Test Clients
- Scripts are emulated play sessions just like
somebody plays the game - Command steps what the player does to the game
- Validation steps what the game should do in
response
25Scripts TailoredTo Each Test Application
- Unit testing 1 feature 1 script
- Load testing Representative play session
- The average Joe, times thousands
- Shipping quality corner cases, feature
completeness - Integration test code changes for catastrophic
failures
26Scripted Players Implementation
Commands
Presentation Layer
27Process Shift
Earlier Tools Investment Equals More Gain
Not Good Enough
28Process Shifts Automated Testing Changes The
Shape Of The Development Progress Curve
Stability (Code Base Servers)
Keep Developers moving forward, not bailing water
Scale Feature Completeness
Focus Developers on key, measurable roadblocks
29Process Shift Measurable Targets, Projected
Trend Lines
Target Complete
Core Functionality Tests, Any Feature (e.g.
clients)
Time
Any Time (e.g. Alpha)
Actionable progress metrics, early enough to react
30Process Shift Load Testing (Before Paying
Customers Show Up)
- Expose issues that only occur at scale
Establish hardware requirements
Establish play is acceptable _at_ scale
31Client-Server Comparison
32TSO Case Study Outline(Lessons Learned)
- Poorly designed SP ? MP ?MMP transitions
- Scaling
- Team code size, data set size
- Build distribution
- Architecture logical code
- Visibility development operations
- Testability development, release, load
- Multi-Player, Non-determinism
- Persistent user data vs code/content updates
- Patching / new content / custom content
33User Data
- Oops!
- Users stored much more data (with much more
variance) that we had planned for - Caused many DB failures, city failures
- BIG problem their persistent data has to work,
always, across all builds DB instances - What helped
- Regression testing, each build, against live set
of user data - What would have helped more
- Sanity checks against the DB
- Range checks against user data
- Better code architecture support for validation
of user data
34Patching / New Content / Custom Content
- Oops!
- Initial Patch budget of 1Meg blown in 1st week of
operations - New Content required stronger, more predictable
process - Custom Content required infrastructure able to
easily add new content, on the fly - Key Issue all effort had gone into going Live,
not creating a sustainable process once Live - Conclusion designing these in would have been
much easier than retrofitting
35Lessons Learned
- autoTest Scripted test clients and instrumented
code rock! - Collection, aggregation and display of test data
is vital in making decisions on a day to day
basis - Lessen the panic
- ScaleBreak is a very clarifying experience
- Stable codeservers greatly ease the pain of
building a MMP game - Hard data (not opinion) is both illuminating and
calming - autoBuild make it pushbutton with instant web
visibility - Use early, use often to get bugs out before going
live - Budget for a strong architect role a strong
design review process for the entire game
lifecycle - Scalability, testability, patching new content
long-term persistence are requirements MUCH
cheaper to design in than frantic retrofitting - KISS principle is mandatory, as is expecting
changes
36Lessons Learned
- Visibility tremendous volumes of data require
automated collectionsummarization - Provide drill-down access to details from summary
view web pages - Get some people on board whove been burned
before a lot of TSOs pain could have been
easily avoided, but little distributed system
experience MMP design issues existed in early
phases of project - Fred Brooks, the 31st programmer
- Strong tools process pays off for large teams
long-term operations - Measure improve your workspace, constantly
- Non-determinism is painful unavoidable
- Minimize impact via explicit design support use
strong, constant calibration to understand it
37Biggest Wins
Code Isolation
Scaffolding
Tools Build / Test / Measure,
Information Management
Pre-Checkin Regression / Load Testing
38Biggest Losses
Architecture Massively peer to peer
Early lack of tools
ifdef across platform / function
Critical Path dependencies
More Details www.maggotranch.com/MMP (3 TSO
Lessons Learned talks)