Title: A Practical Introduction to TALx86
1A Practical Introduction to TALx86
- CS342
- February 15, 2007
- Spencer Burdette
2References
- Greg Morrisett, Karl Crary, Neal Glew, Dan
Grossman, Richard Samuels, Frederick Smith, David
Walker, Stephanie Weirich, and Steve Zdancewic
TALx86 A realistic typed assembly language. In
the 1999 ACM SIGPLAN Workshop on Compiler Support
for System Software, pages 25-35, Atlanta, GA,
USA, May 1999. - Kedar N. Swadi and Andrew W. Appel. Typed Machine
Language and its Semantics. - Dan Grossman and Greg Morrisett. Scalable
Certification for Typed Assembly Language. In
the 2000 ACM SIGPLAN Workshop on Types in
Compilation, Montreal, Canada, September 2000. - Numerous Wikipedia articles http//www.wikipedia.
org - The Cornell TALx86 homepage http//www.cs.cornell
.edu/talc/
3Prerequisites
- Type theory notation and operations
- Various notations that formalize programming
semantics - Polymorphic lambda calculus
- First order predicate calculus
- Functional program languages that lend themselves
to theorem proving - ML, CAML, OCAML
- High Level Language polymorphism constructs
- ad-hoc, subtyping, parametric, etc.
- Assembler Operations and Optimizations
- x86, control-flow graphs, tail-call elimination,
constant folding, etc.
S
S
R
R
4Type Safety
- A looser interpretation than merely data type
safety, (e.g. casting a long int to a char) - A type error is defined as an attempt to perform
an operation on some value that is not
appropriate to its type. - e.g. Setting the program counter to an address
found in a local buffer (buffer overflow). - Performing an arithmetic function on
uninitialized data. - Dereferencing a NULL pointer.
5Type Safety (contd)
- Type safety is closely linked with memory safety.
- Allowing an arbitrary integer to be used as a
pointer violates the principles of memory safety. - Bounds checking of arrays is required for type
safety. - For the sake of simplicity, assume that most (if
not all) types of programmatic flaws that have
been discussed thus far in our class can be
attributed to type violations. - Type safety ultimately aims to provide strong
guarantees about the runtime behavior of a
system. - Useful for proof-carrying or certified code
deployment.
6Type Safety Classifications (1 of 4)
- Static
- Early compile time (semantic analysis)
- Compiler complains of mismatched types in
assignments and other expressions remedied using
explicit casts. - Late or post compile time (optimization time)
- Compiler (or a related tool) constructs data and
control flow graphs and infers types of values
from series of unifications and reductions.
(Hindley-Milner type inference algorithm).
7Type Safety Classifications (2 of 4)
- Dynamic
- Run-time environment assigns or infers types of
variables during execution. - Hybrid
- Polymorphic behavior of language requires
annotation of certain variables that can not be
resolved to a type during the static phase.
8Type Safety Classifications (3 of 4)
- Nominative vs. Structural
- Nominative types must share the precise name to
be compatible - Structural types must describe values that
share the same structure to be compatible. - Weak vs. Strong
- Computer scientist experts agree that the weakly
vs. strongly typed distinction is a grey area. - Indications that a programming language is weakly
typed - Compiler inserts implicit type conversions on
behalf of the programmer. - Language allows programmer access to underlying
bit patterns of data types, thus allowing them to
bypass type checking. - Data types can be cast or used directly for
memory access.
9Type Safety Classifications (4 of 4)
- The series of type classifications results in a
matrix
Language Static/Dynamic Weak/Strong Nominative/Structural Safe?
assembly none strong structural no
C static weak nominative no
Java hybrid strong nominative yes
Javascript dynamic weak nominative yes
Lisp dynamic strong structural yes
ML static strong structural yes
10What is TALx86?
- Typed Assembly Language for a subset of the Intel
x86 ISA. - Consists of a RISC-like assembly language and
operational semantics for a simple abstract
machine. - A formal type system that captures the possible
register, stack, and memory states of a program
as well as their transitions. - Rigorous proofs (well beyond the scope of this
presentation, thankfully) have demonstrated that
TAL enforces certain safety guarantees.
11Without the Jargon
- TALx86 is a low-level target language, analogous
to Java bytecode, that is intended to support a
variety of statically typed, weak or strong
source languages. - Like any good intermediate language, TALx86 has
been designed to support common assembly-level
optimizations. - The TALx86 project provides tools for the
assembly, disassembly, and linking of TAL
binaries.
12Advantages over JVML
- Semantic errors have been uncovered in the JVML
verifier. - It has been suggested that if type-soundness
theorems had been applied to the JVML during the
design phase, more bugs would have been
prevented. - It is difficult to compile high-level languages
other than Java to the JVML, since the
instructions and types are specifically tailored
to Java. - JIT compilation is used to accelerate
performance, however an error in the JIT compiler
can introduce a security hole, since JIT
translation occurs after the verification step.
13Diagram of Process
14Explanation of Process
- Client receives packaged .tal files, similar to a
.jar. - Without access to or knowledge of the program
source code or compiler, the TALx86 type verifier
and link checker can be run. - To prepare the code for execution, trusted
modules are linked in for run time support, and
memory management and array access and update
macros are expanded. - A somewhat optimized, type-safe native machine
code binary is produced.
15Type Annotation Classes
- Import and export interface information.
- Type constructor declarations, for new types and
type abbreviations. - Typing preconditions on code labels. Registers
must have specific types before control may enter
the associated code. - Types on data labels, to specify type of a static
data item. - Typing coercions on instruction operands.
- Macro instructions, used to encapsulate small
instruction sequences.
16Crux of TALx86
- The most important feature of the type checker is
3, the typing preconditions on code labels.
General form of annotations - Registers r1 through rn must contain types
t1through tn before control is passed to the
corresponding label. - The bound type variables a1 an allow types on
registers to be polymorphic, by treating them as
abstract types. - A set of kind of variables is supported,
labeled with k, so only appropriate types are
used to instantiate the bound type variables.
17Reference code snippet
- / Calculate sum of first n natural numbers. /
- int i n1
- int s 0
- while (--i gt 0)
- s i
- Translated into TALx86 as (assumes n initially
resides in ecx) - mov eax, ecx i n
- inc eax i
- mov ebx, 0 s 0
- jmp test
- body eax B4, ebx B4
- add ebx, eax s i
- test eax B4, ebx B4
- dec eax --i
- cmp eax, 0 i gt 0
- jg body
18An Actual Function
- int sum (int n)
- int i n1
- int s 0
- while (--i gt 0)
- s i
- return s
-
- Assume the caller places the return address in
ebp and expects the return value in eax - sum ecx B4, ebp eax B4
- mov eax, ecx i n
- inc eax i
- mov ebx, 0 s 0
- jmp test
- body eax B4, ebx B4, ebp eax B4
- add ebx, eax s i
- test eax B4, ebx B4, ebp eax B4
- dec eax --i
19Problems with Previous Snippet
- Recall on the previous slide that the function
sum() required the return address in ebp, the
function parameter in ecx, and the return value
in eax. Very ad-hoc approach, far from the
standard calling convention. - The standard C calling convention typically
places arguments, return address, the old base
pointer, local parameters, and possibly a return
value on the stack. - Values are not often passed among functions
through registers. - TALx86 is constructed so as to not be bound to a
specific calling convention.
20Stack Layout for a General Purpose Calling
Convention
esp
Caller (int p) char a Callee(a,
p) Callee(char a, int p) int i //
do stuff
Local variables of Callee
Old Base Pointer
Return Address
Parameters for Callee
Local variables of Caller
Old Base Pointer
Return Address
Parameters for Caller
ebp
21TALx86 Stack Abstraction and Stack Datatypes
- Stack is defined as a list of types.
- s is a stack type
- se represents an empty stack
- ts a type that describes stacks where the top
most element is of type t and the rest of the
stack is described by s. - Example
- eax B4B4B4se
- stack type with three elements, a return address
expecting a B4 in eax, followed by two B4 values. - esp can be bound to a stack type with the label
sptr.
22Revisiting our Reference Code
- int sum (int n)
- / Function body /
-
- The above code originally yielded the following
TALx86 label - sum ecx B4, ebp eax B4
- This represents a non-standard calling convention
- Using TALx86 stack abstractions and stack types,
the sum() label can be rewritten as - sum esp sptr eax B4B4se
- Read as Esp must contain a stack pointer that
points to a section of code requiring a B4 in eax
(i.e. the return address), followed by a B4. - Spot a problem? The rewritten label does not
allow for arbitrary stack depths. -
23Supporting Common Calling Conventions (Take 3)
- TALx86 supports stack polymorphism in order to
abstract portions of the stack. - Rewrite as
- sum esp sptr eax B4, esp sptr
B4rB4r - In order to enter the code associated with the
sum label - esp must be a stack pointer that points to a
section of code. - The code pointed to by sptr must require
- eax to hold a B4
- esp to hold a stack pointer that contains a B4
(essentially, an address) - Following that address, there can be some other
stuff, r. - Following the return address, our sums stack
must contain a B4 (the input value n), and some
other stuff r.
24Dynamic Memory
- TALx86 provides an assembly level macro for heap
allocation of data, malloc. - malloc allocates memory and inserts a pointer to
the newly allocated space in eax. - Proper initialization of dynamic memory is
critical to type-safety, so a variance is added
to each field. - e.g. B4u, B4r, B4w, B4rw
- Field variances are tracked.
- Uninitialized data can be written, but not read.
25Dynamic Memory (contd)
- Simple Example
- malloc 4, ltB4gt
- mov eax0, 3
- After the first instruction, the verifier assigns
eax the type B4u, a pointer to an
uninitialized B4. - After the second instruction, the type of eax
becomes - B4rw, a pointer to a readable and writeable
B4. - Naturally, these types can be used as constraints
for basic block labels.
26Arrays
- Array sizes and indices cannot always be
determined statically, however a type safe
language must guarantee that any index lies
between 0 and the physical size of the array. - TALx86 introduces macros to handle array
subscripting and updating asub, aupd,
respectively. - Two type constructors are also provided
- S(s), where s some constant or an abstract
value, a. - array(s, tv)
- s a type expression representing the size of
the array. - t type of element.
- v variance (one of u, r, w, or rw)
27Array Example
- Simple Example
- Increment each element of a 5 integer array.
- lab eax array(5, B4rw), ebx S(5)
- mov ecx, 2
- put eaxecx into edx, array size in ebx, B4
size is 4 - asub edx, eax, 4, ecx, ebx
- inc edx
- put edx into eaxecx, array size in ebx, B4
size is 4 - aupd eax, 4, ecx, edx, ebx
- Clearly only works for arrays of size 5. Here
comes the abstract notation - Sint.eax array(s, B4rw), ebx S(s)
28Link Verification
- Link Verifier
- Ensures that xxx.tal make valid assumptions about
the files and types they share with other modules
using the corresponding xxx_i.tali and xxx_e.tali
files. - In addition to verifying that there are no
missing or multiple symbol definitions, the
TALx86 linker checks that the files agree on the
types of shared values.
29Type Verification
- Type Verification
- Using the given annotations, performs type
verification using a type inference algorithm - Algorithm succeeds if a set of values can be
generated that preserve the label preconditions
while progressing through each of the possible
states. - If the algorithm get stuck, the code is not
type safe. - Optimization methods are performed (constant
folding, common subexpression reduction,
tail-call elimination, etc.) to collapse code
size and minimize runtime overhead. - Anything that is proven to be known at assembly
time is flattened. Run time checks are preserved
elsewhere.
30Benefits
- Language design has been supported by rigorously
proven type-soundness theorems. - Target language is intended to be more generic
than JVML, thus allowing a wider range of source
languages to compile toward it. - Ideally, assembly-level optimizations will yield
higher performance than JVML.
31Limitations
- Although the researchers claim that a type-unsafe
source language (such as C) can be compiled to
TALx86, the restrictions of the target language
contradict this claim. - TALx86 specifically forbids pointer arithmetic,
the address operator, and pointer casts, since
compiling these features safely would impose a
significant performance penalty. (As weve
witnessed in previous discussions!) - Memory management type variance (u, r, w, rw)
tracking does not support aliasing. - No floating point handling.
- Essentially, only a subset of higher level source
language capabilities are provided, and they are
only mapped to a subset of the x86 ISA.
32Discussion