Title: Kathleen Fisher
1PADS A System for Managing Ad Hoc Data
- Kathleen Fisher
- ATT Labs Research
- www.padsproj.org
And many many others
2Kenny Zhu
- Dr. Zhu has been one of the main contributors to
the PADS project. - He is finishing his Post Doc at Princeton and
looking for jobs, both in North America and Asia.
http//www.cs.princeton.edu/kzhu/
3Data, Data, Everywhere!
Incredible amounts of data stored in well-behaved
formats
Databases
Tools
- Schema
- Browsers
- Query Languages
- Standards
- Libraries
- Books, documentation
- Training courses
- Conversion tools
- Vendor support
- Consultants...
XML
4Were not always so lucky!
Vast amounts of chaotic ad hoc data
Tools
5Web Logs
207.136.97.49 - - 15/Oct/2006184651 -0700
"GET /turkey/amnty1.gif HTTP/1.0" 200
3013 207.136.97.49 - - 15/Oct/2006184651
-0700 "GET /turkey/clear.gif HTTP/1.0" 200
76 207.136.97.49 - - 15/Oct/2006184652 -0700
"GET /turkey/back.gif HTTP/1.0" 200
224 207.136.97.49 - - 15/Oct/2006184652
-0700 "GET /turkey/women.html HTTP/1.0" 200
17534 208.196.124.26 - Dbuser 15/Oct/200618465
5 -0700 "GET /candatop.html HTTP/1.0" 200
- 208.196.124.26 - - 15/Oct/2006184657 -0700
"GET /images/done.gif HTTP/1.0" 200
4785 www.att.com - - 15/Oct/2006184701 -0700
"GET /images/reddash2.gif HTTP/1.0" 200
237 208.196.124.26 - - 15/Oct/2006184702
-0700 "POST /images/refrun1.gif HTTP/1.0" 200
836 208.196.124.26 - - 15/Oct/2006184705
-0700 "GET /images/hasene2.gif HTTP/1.0" 200
8833 www.cnn.com - - 15/Oct/2006184708 -0700
"GET /images/candalog.gif HTTP/1.0" 200
- 208.196.124.26 - - 15/Oct/2006184709 -0700
"GET /images/nigpost1.gif HTTP/1.0" 200
4429 208.196.124.26 - - 15/Oct/2006184709
-0700 "GET /images/rally4.jpg HTTP/1.0" 200
7352 128.200.68.71 - - 15/Oct/2006184711
-0700 "GET /amnesty/usalinks.html HTTP/1.0" 143
10329 208.196.124.26 - - 15/Oct/2006184711
-0700 "GET /images/reyes.gif HTTP/1.0" 200 10859
6Haskell HI Files
00000000 0001 face 0000 0073 0400 0000 3600 0000
.......s....6... 00000010 3000 0000 3500 0000
3000 0000 0000 0000 0...5...0....... 00000020
0001 0000 0000 0100 0000 0043 0001 0000
...........C.... 00000030 0002 0200 0000 0200
0000 0300 0000 0200 ................ 00000040
0000 0400 0000 4800 0100 0000 0200 0000
......H......... 00000050 0502 0000 0000 0006
0000 0000 0007 0000 ................ 00000060
0001 0000 0000 6800 0000 0000 006f 0000
......h......o.. 00000070 0000 0100 0000 0800
0000 0968 6173 6b65 ...........haske 00000080
6c6c 3938 0000 0007 4350 5554 696d 6500
ll98....CPUTime. 00000090 0000 0462 6173 6500
0000 0847 4843 2e42 ...base....GHC.B 000000a0
6173 6500 0000 0e47 4843 2e46 6f72 6569
ase....GHC.Forei 000000b0 676e 5074 7200 0000
0e53 7973 7465 6d2e gnPtr....System. 000000c0
4350 5554 696d 6500 0000 0a67 6574 4350
CPUTime....getCP 000000d0 5554 696d 6500 0000
1063 7075 5469 6d65 UTime....cpuTime 000000e0
5072 6563 6973 696f 6e
Precision
7Ad Hoc Data from ATT
8And Many Others...
- Gene ontology data
- Cosmology data
- Financial trading data
- Telecom billing data
- Router config files
- System logs
- Call detail data
- Netflow packets
- DNS packets
- Java JAR files
- Jazz recording info
- ...
9Why a data description language?
- Ad hoc data is difficult to manage
- Data arrives as is in a wide-variety of
encodings and formats. - Documentation is out of data or non-existent.
- Data is buggy and potentially malicious.
- Processing must detect errors and respond in
application-specific ways. - Data sources often have high volume.
- Existing solutions are insufficient
- Lex/Yacc-like technologies target language
syntax, rather than data. - Hand-coded C/Perl programs are time-consuming to
produce, brittle with respect to changes, and
fail to handle errors well. - Data description languages (DDLs) address these
issues - Data expert writes declarative description rather
than a parser. - Description serves as living documentation.
- Parser exhaustively detects errors without
cluttering user code. - Parser can be proven correct with respect to its
handling of buggy data. - From declarative specification, compiler can
generate auxiliary tools.
Data description languages facilitate managing ad
hoc data.
10The PADS/C Data Description Language
- Provides rich and extensible set of base types
for describing atomic data. - Pint8, Puint8, // -123, 44
- Pstring() // hello
Pstring_FW(3) // catdog
Pstring_ME(/a/) //
aaaaaab - Pdate, Ptime, Pip,
- Provides type constructors to describe structured
data, by analogy with C - Pstruct, Parray, Punion, Ptypedef, Penum
- Allows arbitrary predicates to describe expected
properties. - Compiler generates parser, printer, and other
useful tools in a type directed fashion.
In the PADS/C DDL, each piece of data is
described by a type, which specifies the physical
format and semantic constraints of the data.
PADS uses a type metaphor to declaratively
describe ad hoc data.
11Common Log Format in PADS/C
A complete PADS/C description of the web server
log data shown in the box
207.136.97.50 - - 15/Oct/1997184651 -0700
"GET /turkey/amnty1.gif HTTP/1.0" 200 3013
PADS allows concise, precise, and intuitive data
specifications.
12PADS Parsing and Printing
- From a data description, the PADS compiler
generates - a parser, which maps raw input data and a mask to
a pair of an in-memory representation and a parse
descriptor. - a parse descriptor, which records meta-data about
a parse, including location and error
information. - a mask, which allows dynamic customization of
parser behavior. - PADS has a formal semantics, so we can prove
formal - properties about the generated parsers, such as
- If the mask specifies check all properties
and set all representations, and the
parse
descriptor indicates no errors,
then the
in-memory representation is
correct. - Malicious data cannot corrupt the parser.
- The PADS compiler also generates a printer, which
maps an in-memory rep and a
parse descriptor back to raw form. Wed like
printing and parsing to be inverses, but that is
a hard problem in general
PADS uses meta-data to manage buggy or malicious
data.
13Leverage!
- Given a data description, the computer
essentially understands the data. We can leverage
that understanding to generate many tools beyond
a parser
Type directed programming provides this
leverage. For each base type, we have to specify
the desired behavior. The compiler then lifts the
behavior to all structured types.
Type-directed programming allows generation of
useful tools from descriptions.
14Learning Goals Approach
Visual Information
End-user tools
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
Raw Data
Data Description
CSV
XML
Standard formats schema
Problem Producing useful tools for ad hoc data
takes a lot of time. Solution A learning system
to generate data descriptions and tools
automatically.
15Format Inference Overview
XML
XMLifier
Raw Data
Accumlator
Analysis Report
Chunking Process
Tokenization
PADS Description
PADS Compiler
Structure Discovery
IR to PADS Printer
Scoring Function
Format Refinement
16Possible Additional Material
- PADS in More Depth The language, the tools, the
semantics. PLDI 05, POPL 06, POPL 07, PADL 08
(long talk). - Format Inference Basic algorithm, small demo,
and experimental evaluation POPL 08(long talk). - In Progress (short talk)
- Improving format inference by learning
tokenizations PADL 09 - Taking steps towards making inference
incremental. - Learning Demo Perhaps better offline.
17Contributors
- ATT Yitzhak Mandelbaum, Mary Fernandez, and
Andrew Forest - Princeton David Walker, Kenny Zhu, Qian Xi
- Galois Peter White and David Burke
- Penn Nate Foster and Michael Greenberg
18 Motivation Token Ambiguity Problem (TAP)
- Given a string, there are multiple ways to
tokenize it. - Example 1 127.0.0.1
- IP
- Float Dot Float
- Int Dot Int Dot Int Dot Int
- Example 2
- Message
- Word White Word White Word White... White URL
- Word White Quote Filepath Quote White Word
White...
19 How does learnPADS deal with TAP ?
- Tokenization Phase
- Take the first, longest match.
Float
- A fixed order is assigned by the end user.
- We have no order to pick.
Int
ID
Path
As a result, the current learning system cant
have ambiguous base tokens Message, Text,
ID. sometimes produces descriptions that are too
precise.
20Scaling to Larger Data Sets
- Original algorithm keeps entire data set in
memory, so wont scale to large data sets. - Proposed conceptual architecture to permit
incremental learning