A Practical Introduction to TALx86

About This Presentation

Title:

A Practical Introduction to TALx86

Description:

Greg Morrisett, Karl Crary, Neal Glew, Dan Grossman, Richard ... The TALx86 project provides tools for the assembly, disassembly, and linking of TAL binaries. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 33

Provided by: seas6

Learn more at: https://www2.seas.gwu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Practical Introduction to TALx86

1
A Practical Introduction to TALx86

CS342
February 15, 2007
Spencer Burdette

2
References

Greg Morrisett, Karl Crary, Neal Glew, Dan
Grossman, Richard Samuels, Frederick Smith, David
Walker, Stephanie Weirich, and Steve Zdancewic
TALx86 A realistic typed assembly language. In
the 1999 ACM SIGPLAN Workshop on Compiler Support
for System Software, pages 25-35, Atlanta, GA,
USA, May 1999.
Kedar N. Swadi and Andrew W. Appel. Typed Machine
Language and its Semantics.
Dan Grossman and Greg Morrisett. Scalable
Certification for Typed Assembly Language. In
the 2000 ACM SIGPLAN Workshop on Types in
Compilation, Montreal, Canada, September 2000.
Numerous Wikipedia articles http//www.wikipedia.
org
The Cornell TALx86 homepage http//www.cs.cornell
.edu/talc/

3
Prerequisites

Type theory notation and operations
Various notations that formalize programming
semantics
Polymorphic lambda calculus
First order predicate calculus
Functional program languages that lend themselves
to theorem proving
ML, CAML, OCAML
High Level Language polymorphism constructs
ad-hoc, subtyping, parametric, etc.
Assembler Operations and Optimizations
x86, control-flow graphs, tail-call elimination,
constant folding, etc.

S
S
R
R
4
Type Safety

A looser interpretation than merely data type
safety, (e.g. casting a long int to a char)
A type error is defined as an attempt to perform
an operation on some value that is not
appropriate to its type.
e.g. Setting the program counter to an address
found in a local buffer (buffer overflow).
Performing an arithmetic function on
uninitialized data.
Dereferencing a NULL pointer.

5
Type Safety (contd)

Type safety is closely linked with memory safety.
Allowing an arbitrary integer to be used as a
pointer violates the principles of memory safety.
Bounds checking of arrays is required for type
safety.
For the sake of simplicity, assume that most (if
not all) types of programmatic flaws that have
been discussed thus far in our class can be
attributed to type violations.
Type safety ultimately aims to provide strong
guarantees about the runtime behavior of a
system.
Useful for proof-carrying or certified code
deployment.

6
Type Safety Classifications (1 of 4)

Static
Early compile time (semantic analysis)
Compiler complains of mismatched types in
assignments and other expressions remedied using
explicit casts.
Late or post compile time (optimization time)
Compiler (or a related tool) constructs data and
control flow graphs and infers types of values
from series of unifications and reductions.
(Hindley-Milner type inference algorithm).

7
Type Safety Classifications (2 of 4)

Dynamic
Run-time environment assigns or infers types of
variables during execution.
Hybrid
Polymorphic behavior of language requires
annotation of certain variables that can not be
resolved to a type during the static phase.

8
Type Safety Classifications (3 of 4)

Nominative vs. Structural
Nominative types must share the precise name to
be compatible
Structural types must describe values that
share the same structure to be compatible.
Weak vs. Strong
Computer scientist experts agree that the weakly
vs. strongly typed distinction is a grey area.
Indications that a programming language is weakly
typed
Compiler inserts implicit type conversions on
behalf of the programmer.
Language allows programmer access to underlying
bit patterns of data types, thus allowing them to
bypass type checking.
Data types can be cast or used directly for
memory access.

9
Type Safety Classifications (4 of 4)

The series of type classifications results in a
matrix

Language Static/Dynamic Weak/Strong Nominative/Structural Safe?
assembly none strong structural no
C static weak nominative no
Java hybrid strong nominative yes
Javascript dynamic weak nominative yes
Lisp dynamic strong structural yes
ML static strong structural yes
10
What is TALx86?

Typed Assembly Language for a subset of the Intel
x86 ISA.
Consists of a RISC-like assembly language and
operational semantics for a simple abstract
machine.
A formal type system that captures the possible
register, stack, and memory states of a program
as well as their transitions.
Rigorous proofs (well beyond the scope of this
presentation, thankfully) have demonstrated that
TAL enforces certain safety guarantees.

11
Without the Jargon

TALx86 is a low-level target language, analogous
to Java bytecode, that is intended to support a
variety of statically typed, weak or strong
source languages.
Like any good intermediate language, TALx86 has
been designed to support common assembly-level
optimizations.
The TALx86 project provides tools for the
assembly, disassembly, and linking of TAL
binaries.

12
Advantages over JVML

Semantic errors have been uncovered in the JVML
verifier.
It has been suggested that if type-soundness
theorems had been applied to the JVML during the
design phase, more bugs would have been
prevented.
It is difficult to compile high-level languages
other than Java to the JVML, since the
instructions and types are specifically tailored
to Java.
JIT compilation is used to accelerate
performance, however an error in the JIT compiler
can introduce a security hole, since JIT
translation occurs after the verification step.

13
Diagram of Process
14
Explanation of Process

Client receives packaged .tal files, similar to a
.jar.
Without access to or knowledge of the program
source code or compiler, the TALx86 type verifier
and link checker can be run.
To prepare the code for execution, trusted
modules are linked in for run time support, and
memory management and array access and update
macros are expanded.
A somewhat optimized, type-safe native machine
code binary is produced.

15
Type Annotation Classes

Import and export interface information.
Type constructor declarations, for new types and
type abbreviations.
Typing preconditions on code labels. Registers
must have specific types before control may enter
the associated code.
Types on data labels, to specify type of a static
data item.
Typing coercions on instruction operands.
Macro instructions, used to encapsulate small
instruction sequences.

16
Crux of TALx86

The most important feature of the type checker is
3, the typing preconditions on code labels.
General form of annotations
Registers r1 through rn must contain types
t1through tn before control is passed to the
corresponding label.
The bound type variables a1 an allow types on
registers to be polymorphic, by treating them as
abstract types.
A set of kind of variables is supported,
labeled with k, so only appropriate types are
used to instantiate the bound type variables.

17
Reference code snippet

/ Calculate sum of first n natural numbers. /
int i n1
int s 0
while (--i gt 0)
s i
Translated into TALx86 as (assumes n initially
resides in ecx)
mov eax, ecx i n
inc eax i
mov ebx, 0 s 0
jmp test
body eax B4, ebx B4
add ebx, eax s i
test eax B4, ebx B4
dec eax --i
cmp eax, 0 i gt 0
jg body

18
An Actual Function

int sum (int n)
int i n1
int s 0
while (--i gt 0)
s i
return s
Assume the caller places the return address in
ebp and expects the return value in eax
sum ecx B4, ebp eax B4
mov eax, ecx i n
inc eax i
mov ebx, 0 s 0
jmp test
body eax B4, ebx B4, ebp eax B4
add ebx, eax s i
test eax B4, ebx B4, ebp eax B4
dec eax --i

19
Problems with Previous Snippet

Recall on the previous slide that the function
sum() required the return address in ebp, the
function parameter in ecx, and the return value
in eax. Very ad-hoc approach, far from the
standard calling convention.
The standard C calling convention typically
places arguments, return address, the old base
pointer, local parameters, and possibly a return
value on the stack.
Values are not often passed among functions
through registers.
TALx86 is constructed so as to not be bound to a
specific calling convention.

20
Stack Layout for a General Purpose Calling
Convention
esp
Caller (int p) char a Callee(a,
p) Callee(char a, int p) int i //
do stuff
Local variables of Callee
Old Base Pointer
Return Address
Parameters for Callee
Local variables of Caller
Old Base Pointer
Return Address
Parameters for Caller
ebp
21
TALx86 Stack Abstraction and Stack Datatypes

Stack is defined as a list of types.
s is a stack type
se represents an empty stack
ts a type that describes stacks where the top
most element is of type t and the rest of the
stack is described by s.
Example
eax B4B4B4se
stack type with three elements, a return address
expecting a B4 in eax, followed by two B4 values.
esp can be bound to a stack type with the label
sptr.

22
Revisiting our Reference Code

int sum (int n)
/ Function body /
The above code originally yielded the following
TALx86 label
sum ecx B4, ebp eax B4
This represents a non-standard calling convention
Using TALx86 stack abstractions and stack types,
the sum() label can be rewritten as
sum esp sptr eax B4B4se
Read as Esp must contain a stack pointer that
points to a section of code requiring a B4 in eax
(i.e. the return address), followed by a B4.
Spot a problem? The rewritten label does not
allow for arbitrary stack depths.

23
Supporting Common Calling Conventions (Take 3)

TALx86 supports stack polymorphism in order to
abstract portions of the stack.
Rewrite as
sum esp sptr eax B4, esp sptr
B4rB4r
In order to enter the code associated with the
sum label
esp must be a stack pointer that points to a
section of code.
The code pointed to by sptr must require
eax to hold a B4
esp to hold a stack pointer that contains a B4
(essentially, an address)
Following that address, there can be some other
stuff, r.
Following the return address, our sums stack
must contain a B4 (the input value n), and some
other stuff r.

24
Dynamic Memory

TALx86 provides an assembly level macro for heap
allocation of data, malloc.
malloc allocates memory and inserts a pointer to
the newly allocated space in eax.
Proper initialization of dynamic memory is
critical to type-safety, so a variance is added
to each field.
e.g. B4u, B4r, B4w, B4rw
Field variances are tracked.
Uninitialized data can be written, but not read.

25
Dynamic Memory (contd)

Simple Example
malloc 4, ltB4gt
mov eax0, 3
After the first instruction, the verifier assigns
eax the type B4u, a pointer to an
uninitialized B4.
After the second instruction, the type of eax
becomes
B4rw, a pointer to a readable and writeable
B4.
Naturally, these types can be used as constraints
for basic block labels.

26
Arrays

Array sizes and indices cannot always be
determined statically, however a type safe
language must guarantee that any index lies
between 0 and the physical size of the array.
TALx86 introduces macros to handle array
subscripting and updating asub, aupd,
respectively.
Two type constructors are also provided
S(s), where s some constant or an abstract
value, a.
array(s, tv)
s a type expression representing the size of
the array.
t type of element.
v variance (one of u, r, w, or rw)

27
Array Example

Simple Example
Increment each element of a 5 integer array.
lab eax array(5, B4rw), ebx S(5)
mov ecx, 2
put eaxecx into edx, array size in ebx, B4
size is 4
asub edx, eax, 4, ecx, ebx
inc edx
put edx into eaxecx, array size in ebx, B4
size is 4
aupd eax, 4, ecx, edx, ebx
Clearly only works for arrays of size 5. Here
comes the abstract notation
Sint.eax array(s, B4rw), ebx S(s)

28
Link Verification

Link Verifier
Ensures that xxx.tal make valid assumptions about
the files and types they share with other modules
using the corresponding xxx_i.tali and xxx_e.tali
files.
In addition to verifying that there are no
missing or multiple symbol definitions, the
TALx86 linker checks that the files agree on the
types of shared values.

29
Type Verification

Type Verification
Using the given annotations, performs type
verification using a type inference algorithm
Algorithm succeeds if a set of values can be
generated that preserve the label preconditions
while progressing through each of the possible
states.
If the algorithm get stuck, the code is not
type safe.
Optimization methods are performed (constant
folding, common subexpression reduction,
tail-call elimination, etc.) to collapse code
size and minimize runtime overhead.
Anything that is proven to be known at assembly
time is flattened. Run time checks are preserved
elsewhere.

30
Benefits

Language design has been supported by rigorously
proven type-soundness theorems.
Target language is intended to be more generic
than JVML, thus allowing a wider range of source
languages to compile toward it.
Ideally, assembly-level optimizations will yield
higher performance than JVML.

31
Limitations

Although the researchers claim that a type-unsafe
source language (such as C) can be compiled to
TALx86, the restrictions of the target language
contradict this claim.
TALx86 specifically forbids pointer arithmetic,
the address operator, and pointer casts, since
compiling these features safely would impose a
significant performance penalty. (As weve
witnessed in previous discussions!)
Memory management type variance (u, r, w, rw)
tracking does not support aliasing.
No floating point handling.
Essentially, only a subset of higher level source
language capabilities are provided, and they are
only mapped to a subset of the x86 ISA.