Title: ChingTsun Chou March 2003 Slide 0
1Applying Formal Methods to Protocol
Specifications andSystem Architecture
Ching-Tsun Chou Multi-Processor
Architecture Enterprise Platforms Group Intel
Corporation
2Disclaimer
- The views expressed in this talk are the
presenters alone and not necessarily those of
Intel Corporation
3Why formal methods?
- Architectural specifications contain complex
distributed protocols whose correctness is
nontrivial to establish - Examples Directory-based cache coherence
protocols, forward-progress mechanisms,
variations of sliding window protocols, - Goal Get protocol specifications correct before
implementations commence - The earlier a bug is found, the easier it is to
fix it, and the more flexibility there is in
possible fixes - Formal verification (FV) is a body of powerful
techniques for achieving this goal - Formal modeling promotes clear thinking and
minimizes misunderstanding and misinterpretation
of specifications - In the early stages of protocol design, more bugs
are found during formal modeling than by model
checking - Protocol design and formal modeling should go
hand in hand - Formal modeling produces unambiguous golden
models of at least some aspects of complex
protocols - Executable reference models can be generated from
formal models - Experience shows that formal methods work
- Already a standard industry practice Intel, Sun,
IBM, Compaq, SGI, ... - As this talk hopes to demonstrate
4Formal vs simulation-based verification
- Simulation-based verification
- Check a small fraction of all possible behaviors
of a large model - Very large and relatively complete model
- Model need not be simplified or abstracted
- Only a very small part of the state space is
explored - Need to generate tests and collect coverage
feedback - gt Results only as good as your tests and checkers
- Formal verification
- Check all possible behaviors of a small model
- All states are exhaustively explored
- No tests are needed and coverage is 100
- Only very small models can be handled
- Often need drastic simplification and
abstraction - gt Results only as good as your models and
properties
Moral There is no free lunch!
5Overview of Intels Scalability Port (SP)
architecture
- Designed for mid-range shared memory
multiprocessors - Employ high-speed point-to-point interconnect
that provides good scalability for mid-range to
high-end systems - Shared buses are neither cost-effective nor
scalable beyond limited number of processors due
to signaling, thermal, mechanical, and other
challenges - Support flexible system architecture
- Enable cost-optimal small systems to scalable
high-end systems - Enable system vendors with proprietary system
interconnects and components to use Intel
building blocks - An instance of SP was implemented in Intels
E8870 chipset
6Overview of SP cache coherence protocol
- Make no assumption whatsoever about the relative
timing of events or the ordering of messages - Completely asynchronous, event-driven
specification - Directory-based, though the directory
- is optional (no directory null directory)
- may or may not be physically distributed
- A generalization of the invalidation-based MESI
protocol, but different caching agents may do
MESI-state transitions at different times - Employ mechanisms to resolve conflicts
collisions of requests from different requesters
to the same cache line in a distributed manner - Table-based specification with 1,000 rows in all
tables - gt20 transaction types, each of which has a
different behavior by itself and can interact
with every other transaction type
7SP cache coherence protocol validation flow
Protocol specification
Boolean rules
Non-table-based code
Extracted p-tables
Generated p-tables
Formal verification model
C reference model
?
Model checking
Simulation
Find easy bugs in protocol spec
Find hard bugs in protocol spec
Find bugs in implementations
8Properties verified
- Data consistency
- If a caches state is valid (i.e., S, E, or M),
then its data is up to date - Cache and directory state consistency
- If any cache is in state E or M, the other caches
must be in I - If a presence bit in directory is 0, the
corresponding cache must be in I - If the directory state is I, all presence bits
are 0 (and hence all caches are I) - If the directory state is S, the caches whose
presence bits are 1 are in I or S - If the directory state is E, there is exactly one
presence bit being 1 - Weak liveness properties AG EF (cs CS), for
each control state cs and each possible value
CS of cs - Excellent guard against missing rows in protocol
tables and other unexpected cases - Detect both global and local deadlocks, but not
livelock or starvation - Do not rely on fairness assumptions
9Results of SP cache coherence protocol FV
- An SP cache coherence protocol has gt1033
reachable states for a configuration containing 1
cache-line address, 1 home node, 2 caching nodes,
and all gt20 transaction types - Each property takes (on the average) 45 hours to
model-check on a 700 MHz Pentium III Xeon machine
with 4 GB of physical memory - Many interesting bugs were found in successive
versions of SP cache coherence protocol by both
formal modeling and model checking - In fact, more bugs were found by the former than
by the latter in the early phase of SP protocol
design - Not surprisingly, most problems were found when
SP was first designed and during major revisions
(e.g., when new transaction types were added) - But even minor revisions could introduce problems
- Moral As far as cache coherence protocols are
concerned, unaided human reasoning should not be
trusted
10Rule-based table checking flow
Specification document in word processor (e.g,.
FrameMaker)
Specification document in HTML
Convert
Extract flatten
Rules
Pre-processed table
Post-process
Generate
?
Post-processed table
Generated table
11Why rule-based table checking works
- Tables and rules take two fundamentally different
but complementary views - Tables are row-centric and enumeration of cases
(row case) - Rules are column-centric and expression of
relationships between columns - By comparing the two views against each other,
the chance of a bug escaping is minimized - Ideally, tables and rules should be constructed
by two different persons - Expression of complex relationships between
visible columns is simplified by means of hidden
columns - Cause-and-effect metaphor Hidden columns are
the ultimate but invisible causes of visible
columns - Hidden columns are hidden by existentially
quantifying them away - Hidden columns are used to increase further the
difference of the two views
12Results of rule-based table checking
- Coded boolean rules for SP protocol tables and
checked them against each other - Typically dozens of errors were found before
tables and rules agree - Most errors were trivial (e.g., typos), but some
were more serious (e.g., missing cases or
systematic misunderstanding) - Maintained the agreement between tables and rules
over 2 years and tens of major and minor protocol
revisions - Changing rules to keep up with tables almost
never required more efforts than changing tables
themselves - Rule-based table checking is our first line of
defense, flushes out virtually all easy bugs,
and has very low computational overhead - It takes lt 5 minutes to extract and verify by
rules all SP protocol tables - We are not advocating that code review of
tables be eliminated - Code review is still a must at the beginning
- We do advocate that insights from code review
be captured and codified by rules and re-used
later when tables are changed
13Novel applications of binary decision diagrams
- Rule-based table generation and checking
- Boils down to enumerating satisfying truth
assignments of boolean expressions over
enumerated types - Search for minimal deadlock-free wormhole routing
scheme - A wormhole routing scheme is deadlock-free ? Its
channel dependency graph is acyclic ? The
transitive closure of the graph contains no
self-loop - Hence reducible to BDD fixpoint computation
- Details in our FMSD paper
- Search for fault-tolerant link initialization
sequences - Details in our FMSD paper
- Observations
- Formal methods thinking leads to new ways of
looking at old problems - A little BDD goes a long way
- BDD is an efficient representation of boolean
rules (1. above) - BDD supports exhaustive search of a solution
space (2. 3. above)
14Lessons learned
- Formal modeling steered us toward more precise
and concrete protocol specifications than we
would have written without it - Even an abstract formal model requires one to
spell out what exactly one means by each protocol
structure and action - Formal modeling also turned out to be an
excellent way to help architects articulate their
ideas - Formal verification gave us much higher
confidence in the correctness of our protocol
specifications than we would have without it - Certain distributed protocols (e.g.,
directory-based cache coherence protocols) are
too complex for unaided human reasoning alone to
get correct - Formal verification makes it less risky to modify
protocol specifications - Architecture definition affords a rich and
fruitful area for the application of formal
methods - Avoid state explosion with a high level of
abstraction - Get to bugs at the earliest possible stage
- Encourage architects to choose more
validation-friendly schemes - Applying formal methods to a specification
enables the exploration of design spaces that are
beyond the scope of any particular implementation - Especially important for Intel, which defines
architectures that will be implemented by
multiple vendors over multiple product generations
15Acknowledgements
- Mani Azimi, Jay Jayasimha, Akhilesh Kumar, Victor
W. Lee, Phanindra K. Mannava, Seungjoon Park, and
Aniruddha Vaidya all contributed to the work
described above.