Kathleen Fisher - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Kathleen Fisher

Description:

He is finishing his Post Doc at Princeton and looking for jobs, ... Regulus data: Monitor IP network. ASCII. 15 sources, ~15 GB/day. Netflow: Monitor IP network ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 21
Provided by: dav8192
Category:

less

Transcript and Presenter's Notes

Title: Kathleen Fisher


1
PADS A System for Managing Ad Hoc Data
  • Kathleen Fisher
  • ATT Labs Research
  • www.padsproj.org

And many many others
2
Kenny Zhu
  • Dr. Zhu has been one of the main contributors to
    the PADS project.
  • He is finishing his Post Doc at Princeton and
    looking for jobs, both in North America and Asia.

http//www.cs.princeton.edu/kzhu/
3
Data, Data, Everywhere!
Incredible amounts of data stored in well-behaved
formats
Databases
Tools
  • Schema
  • Browsers
  • Query Languages
  • Standards
  • Libraries
  • Books, documentation
  • Training courses
  • Conversion tools
  • Vendor support
  • Consultants...

XML
4
Were not always so lucky!
Vast amounts of chaotic ad hoc data
Tools
  • Perl
  • Awk
  • C
  • ...

5
Web Logs
207.136.97.49 - - 15/Oct/2006184651 -0700
"GET /turkey/amnty1.gif HTTP/1.0" 200
3013 207.136.97.49 - - 15/Oct/2006184651
-0700 "GET /turkey/clear.gif HTTP/1.0" 200
76 207.136.97.49 - - 15/Oct/2006184652 -0700
"GET /turkey/back.gif HTTP/1.0" 200
224 207.136.97.49 - - 15/Oct/2006184652
-0700 "GET /turkey/women.html HTTP/1.0" 200
17534 208.196.124.26 - Dbuser 15/Oct/200618465
5 -0700 "GET /candatop.html HTTP/1.0" 200
- 208.196.124.26 - - 15/Oct/2006184657 -0700
"GET /images/done.gif HTTP/1.0" 200
4785 www.att.com - - 15/Oct/2006184701 -0700
"GET /images/reddash2.gif HTTP/1.0" 200
237 208.196.124.26 - - 15/Oct/2006184702
-0700 "POST /images/refrun1.gif HTTP/1.0" 200
836 208.196.124.26 - - 15/Oct/2006184705
-0700 "GET /images/hasene2.gif HTTP/1.0" 200
8833 www.cnn.com - - 15/Oct/2006184708 -0700
"GET /images/candalog.gif HTTP/1.0" 200
- 208.196.124.26 - - 15/Oct/2006184709 -0700
"GET /images/nigpost1.gif HTTP/1.0" 200
4429 208.196.124.26 - - 15/Oct/2006184709
-0700 "GET /images/rally4.jpg HTTP/1.0" 200
7352 128.200.68.71 - - 15/Oct/2006184711
-0700 "GET /amnesty/usalinks.html HTTP/1.0" 143
10329 208.196.124.26 - - 15/Oct/2006184711
-0700 "GET /images/reyes.gif HTTP/1.0" 200 10859
6
Haskell HI Files
00000000 0001 face 0000 0073 0400 0000 3600 0000
.......s....6... 00000010 3000 0000 3500 0000
3000 0000 0000 0000 0...5...0....... 00000020
0001 0000 0000 0100 0000 0043 0001 0000
...........C.... 00000030 0002 0200 0000 0200
0000 0300 0000 0200 ................ 00000040
0000 0400 0000 4800 0100 0000 0200 0000
......H......... 00000050 0502 0000 0000 0006
0000 0000 0007 0000 ................ 00000060
0001 0000 0000 6800 0000 0000 006f 0000
......h......o.. 00000070 0000 0100 0000 0800
0000 0968 6173 6b65 ...........haske 00000080
6c6c 3938 0000 0007 4350 5554 696d 6500
ll98....CPUTime. 00000090 0000 0462 6173 6500
0000 0847 4843 2e42 ...base....GHC.B 000000a0
6173 6500 0000 0e47 4843 2e46 6f72 6569
ase....GHC.Forei 000000b0 676e 5074 7200 0000
0e53 7973 7465 6d2e gnPtr....System. 000000c0
4350 5554 696d 6500 0000 0a67 6574 4350
CPUTime....getCP 000000d0 5554 696d 6500 0000
1063 7075 5469 6d65 UTime....cpuTime 000000e0
5072 6563 6973 696f 6e
Precision
7
Ad Hoc Data from ATT
8
And Many Others...
  • Gene ontology data
  • Cosmology data
  • Financial trading data
  • Telecom billing data
  • Router config files
  • System logs
  • Call detail data
  • Netflow packets
  • DNS packets
  • Java JAR files
  • Jazz recording info
  • ...

9
Why a data description language?
  • Ad hoc data is difficult to manage
  • Data arrives as is in a wide-variety of
    encodings and formats.
  • Documentation is out of data or non-existent.
  • Data is buggy and potentially malicious.
  • Processing must detect errors and respond in
    application-specific ways.
  • Data sources often have high volume.
  • Existing solutions are insufficient
  • Lex/Yacc-like technologies target language
    syntax, rather than data.
  • Hand-coded C/Perl programs are time-consuming to
    produce, brittle with respect to changes, and
    fail to handle errors well.
  • Data description languages (DDLs) address these
    issues
  • Data expert writes declarative description rather
    than a parser.
  • Description serves as living documentation.
  • Parser exhaustively detects errors without
    cluttering user code.
  • Parser can be proven correct with respect to its
    handling of buggy data.
  • From declarative specification, compiler can
    generate auxiliary tools.

Data description languages facilitate managing ad
hoc data.
10
The PADS/C Data Description Language
  • Provides rich and extensible set of base types
    for describing atomic data.
  • Pint8, Puint8, // -123, 44
  • Pstring() // hello
    Pstring_FW(3) // catdog
    Pstring_ME(/a/) //
    aaaaaab
  • Pdate, Ptime, Pip,
  • Provides type constructors to describe structured
    data, by analogy with C
  • Pstruct, Parray, Punion, Ptypedef, Penum
  • Allows arbitrary predicates to describe expected
    properties.
  • Compiler generates parser, printer, and other
    useful tools in a type directed fashion.

In the PADS/C DDL, each piece of data is
described by a type, which specifies the physical
format and semantic constraints of the data.
PADS uses a type metaphor to declaratively
describe ad hoc data.
11
Common Log Format in PADS/C
A complete PADS/C description of the web server
log data shown in the box
207.136.97.50 - - 15/Oct/1997184651 -0700
"GET /turkey/amnty1.gif HTTP/1.0" 200 3013
PADS allows concise, precise, and intuitive data
specifications.
12
PADS Parsing and Printing
  • From a data description, the PADS compiler
    generates
  • a parser, which maps raw input data and a mask to
    a pair of an in-memory representation and a parse
    descriptor.
  • a parse descriptor, which records meta-data about
    a parse, including location and error
    information.
  • a mask, which allows dynamic customization of
    parser behavior.
  • PADS has a formal semantics, so we can prove
    formal
  • properties about the generated parsers, such as
  • If the mask specifies check all properties

    and set all representations, and the
    parse
    descriptor indicates no errors,
    then the
    in-memory representation is
    correct.
  • Malicious data cannot corrupt the parser.
  • The PADS compiler also generates a printer, which
    maps an in-memory rep and a
    parse descriptor back to raw form. Wed like
    printing and parsing to be inverses, but that is
    a hard problem in general

PADS uses meta-data to manage buggy or malicious
data.
13
Leverage!
  • Given a data description, the computer
    essentially understands the data. We can leverage
    that understanding to generate many tools beyond
    a parser

Type directed programming provides this
leverage. For each base type, we have to specify
the desired behavior. The compiler then lifts the
behavior to all structured types.
Type-directed programming allows generation of
useful tools from descriptions.
14
Learning Goals Approach
Visual Information
End-user tools
Email
struct ........ ...... ...........
ASCII log files
Binary Traces
Raw Data
Data Description
CSV
XML
Standard formats schema
Problem Producing useful tools for ad hoc data
takes a lot of time. Solution A learning system
to generate data descriptions and tools
automatically.
15
Format Inference Overview
XML
XMLifier
Raw Data
Accumlator
Analysis Report
Chunking Process
Tokenization
PADS Description
PADS Compiler
Structure Discovery
IR to PADS Printer
Scoring Function
Format Refinement
16
Possible Additional Material
  • PADS in More Depth The language, the tools, the
    semantics. PLDI 05, POPL 06, POPL 07, PADL 08
    (long talk).
  • Format Inference Basic algorithm, small demo,
    and experimental evaluation POPL 08(long talk).
  • In Progress (short talk)
  • Improving format inference by learning
    tokenizations PADL 09
  • Taking steps towards making inference
    incremental.
  • Learning Demo Perhaps better offline.

17
Contributors
  • ATT Yitzhak Mandelbaum, Mary Fernandez, and
    Andrew Forest
  • Princeton David Walker, Kenny Zhu, Qian Xi
  • Galois Peter White and David Burke
  • Penn Nate Foster and Michael Greenberg

18
Motivation Token Ambiguity Problem (TAP)
  • Given a string, there are multiple ways to
    tokenize it.
  • Example 1 127.0.0.1
  • IP
  • Float Dot Float
  • Int Dot Int Dot Int Dot Int
  • Example 2
  • Message
  • Word White Word White Word White... White URL
  • Word White Quote Filepath Quote White Word
    White...

19
How does learnPADS deal with TAP ?
  • Tokenization Phase
  • Take the first, longest match.

Float
  • A fixed order is assigned by the end user.
  • We have no order to pick.

Int
ID
Path
As a result, the current learning system cant
have ambiguous base tokens Message, Text,
ID. sometimes produces descriptions that are too
precise.
20
Scaling to Larger Data Sets
  • Original algorithm keeps entire data set in
    memory, so wont scale to large data sets.
  • Proposed conceptual architecture to permit
    incremental learning
Write a Comment
User Comments (0)
About PowerShow.com