Title: Ad Hoc Data: From Uggh to Smug
1Ad Hoc Data From Uggh to Smug
- David Walker
- Princeton University
00000000 9192 d8fb 8480 0001 05d8 0000 0000 0872
...............r 00000010 6573 6561 7263 6803
6174 7403 636f 6d00 esearch.att.com. 00000020
00fc 0001 c00c 0006 0001 0000 0e10 0027
...............' 00000030 036e 7331 c00c 0a68
6f73 746d 6173 7465 .ns1...hostmaste 00000040
72c0 0c77 64e5 4900 000e 1000 0003 8400
r..wd.I......... 00000050 36ee 8000 000e 10c0
0c00 0f00 0100 000e 6............... 00000060
1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
......linux..... 00000070 0f00 0100 000e 1000
0c00 0a07 6d61 696c ............mail 00000080
6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
- ?
2Ad Hoc Data is Everywhere
- Lots of data in databases gt even more data that
isnt - Ad Hoc Data sets of semi-structured data files
for which standard data processing tools are
unavailable - Tasks getting the data into a database (and
other kinds of transformations), data cleaning,
querying, editing, parsing... - Troubles error prone, limited documentation,
evolving formats, huge volume, ...
Router Configs
Network Monitoring
Web Logs
Billing Info
Cosmology Data
3Two New Systems
- Anne A Mark-up Language for Ad Hoc Data PLDI
2010 - with Qian Xi (Princeton)
- Forest A Language for Specifying Environmental
Assumptions - with Kathleen Fisher (ATT)
- Nate Foster (Princeton)
- Kenny Zhu (Jiao Tong Shanghai University)
4Anne A Context-free Mark-up Language for Ad
Hoc DataPLDI 2010
Qian Xi
5The Problem
- What is the fastest, most reliable way to go from
data like this - To a parse tree like this
- And generate documentation (a grammar) and tools
such as a parser, printer, query engine, editor,
xml converter, ... - - "GET /turkey/amnty1.gif
HTTP/1.0" 200 3013 polux.entelchile.net - - "GET
/latinam/spoeadp.html HTTP/1.0" 200 8540 ...
6Our Solution Anne
- Develop a mark-up language for ordinary text
- programmers annotate raw text using a set of
grammatical directives - a simple, predictable algorithm generates a
complete grammar processing tools from
directives the surrounding raw data - Pros
- really easy to use
- directives are simple -- applied when where
needed - you can do it at 3am
- predictable
- documentation and tools may be generated
automatically - Cons
- not completely automatic
- but Im skeptical any other more magical bullet
exists anyway
7Document - - "GET /turkey/amnty1.gif
HTTP/1.0" 200 3013 - - "GET
/turkey/clear.gif HTTP/1.0" 200
76 polux.entel.net - - "GET /latinam/spoeadp.html
HTTP/1.0" 200 8540 - - "GET
/images/spot5.gif HTTP/1.0" 304
- ip160.rid.nj.pub-ip.psi.net - - "GET
/whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org -
amnesty "GET /members/afreport.html HTTP/1.0" 200
Generated Grammar
Edit document to add directives
Entry207.136.97.49 - - "GET /turkey/amnty1.gif
HTTP/1.0" 200 3013 - - "GET
/turkey/clear.gif HTTP/1.0" 200
76 polux.entel.net - - "GET /latinam/spoeadp.html
HTTP/1.0" 200 8540 - - "GET
/images/spot5.gif HTTP/1.0" 304
- ip160.rid.nj.pub-ip.psi.net - - "GET
/whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org -
amnesty "GET /members/afreport.html HTTP/1.0" 200
Generated Grammar
Entry int . int . int . int
word ... int int
Default tokenization of tagged data
Non-terminal name drawn from directive
Second directive
Entry207.136.97.49 ID- "GET
/turkey/amnty1.gif HTTP/1.0" 200
3013 - - "GET /turkey/clear.gif
HTTP/1.0" 200 76 polux.entel.net - - "GET
/latinam/spoeadp.html HTTP/1.0" 200
8540 - - "GET /images/spot5.gif
HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - -
"GET /whatsnew.html HTTP/1.0" 404
168 ppp31.igc.org - amnesty "GET
/members/afreport.html HTTP/1.0" 200 450
Generated Grammar
New grammar rule
ID - Entry int . int . int . int
ID word ... int int
Default grammar now incluldes new non-terminal
multiple identical name occurrences imply union
of grammars
Entry207.136.97.49 ID- "GET
/turkey/amnty1.gif HTTP/1.0" 200
3013 - - "GET /turkey/clear.gif
HTTP/1.0" 200 76 polux.entel.net - - "GET
/latinam/spoeadp.html HTTP/1.0" 200
8540 - - "GET /images/spot5.gif
HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - -
"GET /whatsnew.html HTTP/1.0" 404
168 ppp31.igc.org IDamnesty "GET
/members/afreport.html HTTP/1.0" 200 450
Generated Grammar
union of grammars
ID - word Entry int . int . int . int
ID word ... int int
denotes presence of constant string
Entry207.136.97.49 ID- GET
/turkey/amnty1.gif HTTP/1.0" 200
3013 - - "GET /turkey/clear.gif
HTTP/1.0" 200 76 polux.entel.net - - "GET
/latinam/spoeadp.html HTTP/1.0" 200
8540 - - "GET /images/spot5.gif
HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - -
"GET /whatsnew.html HTTP/1.0" 404
168 ppp31.igc.org IDamnesty "GET
/members/afreport.html HTTP/1.0" 200 450
Generated Grammar
ID - word Entry int . int . int . int
ID GET ... int int
directs the system to infer a terminating symbol
a space follows the closing brace
EntryLoc207.136.97.49 ID- GET
/turkey/amnty1.gif HTTP/1.0" 200
3013 - - "GET /turkey/clear.gif
HTTP/1.0" 200 76 polux.entel.net - - "GET
/latinam/spoeadp.html HTTP/1.0" 200
8540 - - "GET /images/spot5.gif
HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - -
"GET /whatsnew.html HTTP/1.0" 404
168 ppp31.igc.org IDamnesty "GET
/members/afreport.html HTTP/1.0" 200 450
Generated Grammar
any string terminated by a space
Loc ID - word Entry Loc
ID GET ... int int
13Interjection The Config File
- A config file provides a mechanism for defining
regular expressions and giving them names - def is an internal definition
- exp is an exported named regular expression
- The default config file provides regular
expressions for common systems data (IP, dates,
times, URL, email, ... )
def db 0-90-9 def zone -0-10-900 def
ampm am\AM\pm\PM def trip 0-90-90-9\0-9
0-9\0-9 ... exp Time dbdbdb\(
ampm\)?\( \tzone\)? exp IP
pre-defined token
EntryIP207.136.97.49 ID- GET
/turkey/amnty1.gi .... 200 3013 -
- "GET /turkey/clear.gif HTTP/1.0" 200
76 polux.entel.net - - "GET /latinam/spoeadp.html
HTTP/1.0" 200 8540 - - "GET
/images/spot5.gif HTTP/1.0" 304
- ip160.rid.nj.pub-ip.psi.net - - "GET
/whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org
IDamnesty "GET /members/afreport.html
HTTP/1.0" 200 450
Generated Grammar
Definition drawn from config file
IP ... from config file ... ID -
word Entry IP ID GET ...
int int
15XML Generation Debugging
16Other Features
- Most features inspired by similar constructs
found in PADS - Enumerations
- Recursion (context-freedom)
- Kleene Star
- with optional element definitions, separators,
and terminators) - Options
- Prioritized Unions
- Assertions
- Tables
- Generated Artifacts
- PADS description (and from there, the PADS tool
suite) - XML CSS for debugging
- Semantics connections to Relevance Logic see
17Repetition (1)
Kleene Star with elements separated by and
defined by first element
Elem int Record (Elem ( Elem) )?
Kleene Star with elements separated by and
defined by Item
Repetition (2)
Item int Record (Item ( Item) )?
18? denotes optional data
Optional Data
Item int? Record (Item ( Item) )?
missing elelments
Assertions Context-Freedom
! claims underlying data will satisfy nonterminal
Parens (( Parens ))?
19Table (1)
EJason Blake, 78 25 38 63
-2 Alexei Ponikarovsky, 82 23 38 61
6 ...
Row Word Word , \t int ... Record
Row (NL Row)
Table (2)
EhName GP Goals Assists Points /- Jason
Blake, 78 25 38 63 -2 Alexei
Ponikarovsky, 82 23 38 61 6 ...
Row ... Header Name \t ... Record
Header NL Row
20ForestA SpecificationLanguagefor
EnvironmentalAssumptionswork in progress!
Kathleen Fisher
Nate Foster
Kenny Zhu
21(No Transcript)
22PADS Web Site
23(No Transcript)
24If only we could...
- Describe required file and directory structure,
including permissions, etc. - Check that the actual file system matches the
spec. - Eliminate a whole class of errors!
25CORAL Monitoring System
- Monitoring system for an Internet-scale,
self-organizing, web-content distribution
network developed by Mike Freedman, Princeton.
26Observations on Monitoring
- Coral is similar to other monitoring systems
PlanetLab and a multitude of systems at ATT. - Often a configuration file specifies which hosts
to monitor, what data to collect, and how often. - File and directory names encode meta-data.
- Want to ask questions such as
- what was the total load on planetlab1 last week?
- on what days and at what times are files are
missing? - what is the maximum memory usage?
- Answering questions requires formulating queries
both in terms of the contents of files and the
structure of the file system (directory names,
files names)
27Other Possible Examples
- File Hierarchy Standard (FHS) for unix-like
installations - Haskell code base, PADS Source Tree
- source code, data, examples, executables, ...
- Cabal system for GHC libraries
- Disk cache for browser history, IMAP mail
- Scientific data sets
- CVS, SVN, other source control systems
28To Do!
- We need a language not just for specifying the
contents (formats) of ad hoc data files but also
for the structure of file system fragments - specify files
- directory structure
- dependencies (config files determine file system
structure) - meta-data (permissions, sizes, owners,
modification times) - The Plan
- Build such a specification language on top of
PADS - Generate a checker from the specifications
- Interface that allows programs to slurp up
specified data from the file system - Stand-alone tools query engine, monitor, etc...
29Back to CORAL
30Example CORAL
ptype conf_t ... - pads description
- ptype corald_t ... - pads description
- ptype dns_t ... - pads description
- ptype web_t ... - pads description
- ptype probe_t ... - pads description
31Example CORAL
ptype conf_t ... - pads description
- ptype corald_t ... - pads description
- ptype dns_t ... - pads description
- ptype web_t ... - pads description
- ptype probe_t ... - pads description
- ptype date_d(tpdate) pdirectory
corald is "corald.log" corald_t lt
timestamp gt t gt coraldns is "nssrv.log"
dns_t lt timestamp gt t gt coralweb is
"websrv.log" web_t lt timestamp gt t gt
probe is "probed.log" probe_t lt
timestamp gt t gt time pdate t
32Example CORAL
ptype conf_t ... - pads description
- ptype corald_t ... - pads description
- ptype dns_t ... - pads description
- ptype web_t ... - pads description
- ptype probe_t ... - pads description
- ptype date_d(tpdate) pdirectory ...
as before ... ptype host_d pdirectory
times is tdate_d(t) t lt- pdate
33Example CORAL
ptype conf_t ... - pads description
- ptype corald_t ... - pads description
- ptype dns_t ... - pads description
- ptype web_t ... - pads description
- ptype probe_t ... - pads description
- ptype host_d(hphostname, tpdate)
pdirectory ... as before ... ptype host_d ()
pdirectory hosts is tdate_d(t) t lt-
pdate ptype coral_d () pdirectory
hostNames is Config conf_t hosts is
hhost_d h lt hostNames
34Current Future Plans
- Designing a semantics based on a classical logic
of trees - We considered using one of the substructural
(separating) tree logics but we discarded it as
the substructural logics gave us the wrong
defaults made the system harder to design and
understand (especially in the presence of parent
pointers) - Building a file system parser tool generation
infrastructure in Haskell - Leverage type-directed programming.
- Leverage laziness in loading structures.
- Envision a collection of file system management
tools based on descriptions - valid desc d -- check for conformance to d
- ls desc d -- list files described by d
- grep pattern desc d -- grep for pattern in files
described by d - mv desc d foo bar -- move files described by d
rooted at foo to bar - Thinking about a query engine continuous
monitoring system - Considering extensions to handle other elements
of the programming environment environment
35The End