Title: Alias I Linguistic Pipeline Architecture, Algorithms
1Alias I Linguistic PipelineArchitecture,
Algorithms Applications
- Bob Carpenter
- Alias I, Inc.
- carp_at_aliasi.com
2Who is Alias-i?
- Spun out of 1995 U Penn Message Understanding
Conference (MUC-5) projects on coreference - Founded in 2000 by Breck Baldwin as Baldwin
Language Technologies - Im the other technical employee as of 2003.
- Funded through the Defense Advance Research
Projects Agency (DARPA) through the Translingual
Information Detection, Extraction and
Summarization Program (TIDES) and the Total, er
Terrorist Information Awareness Program (TIA) - Targeting Research Analysts with Text Mining
- Based in Brooklyn (we love visitors)
3Application Threat Tracker Interface
- Intended for use by Information Analysts
- Analysts typically get short-term assignments and
are asked to do thorough reviews, producing
reports at the end. - Some analysts are assigned to track situations
longer term. - Use unstructured news feeds and standing
collections as sources - Basically, a lot like legal, medical or
biological research - Trackers Specify Structured Searchers Gatherers
- Entities, Sub-trackers, Sample Documents, Saved
Keyword Searches, Alerts - Allow addition of annotated documents making up a
case - Entities Specify
- Aliases
- Spelling, Language, Coreference Properties
- Properties
- Person (Gender), Place, Thing, Other
- Trackers Evaluated against real-time feeds and/or
standing collections
4Tracker Example(s)
- Tracker New York Yankees
- Entity New York Yankees
- Aliases Bronx bombers,
- Properties Organization
- Tracker Yankee Players
- Entity Joe Dimaggio
- Aliases Joltin Joe, The Yankee Clipper, Joe D
- Properties Person/male
- Entity Babe Ruth
-
- Entity Yankee Stadium
- Aliases The stadium, The house that Ruth built,
- Properties Facility
- Document (The Onion) Steinbrenner corners
free-agent market. - Tracker Sports
- Tracker Baseball
- Tracker Teams
- Tracker NY Yankees
5ThreatTracker Interface Screenshot
indicates sentences have been removed because
they dont mention the Entity
Translation of Excerpt Summary
Mentions of Vajpayee and Pakistan found by
ThreatTrackers
6ThreatTracker Architecture
7Client and Web-Container ArchitectureFlexible
Model-View-Controller (MVC)
8ThreatTrackerDocumentAnalysis20k
words/sec250k docs/1.5GB
9LingPipe Document Analysis
- LingPipe implements (most of) Document Analysis
- XML, HTML and Plain Text input (well-formed) XML
output - Tokenization
- Named-entity Extraction
- Sentence Boundary Detection
- Within-document Coreference
- Not yet released cross-document coreference
- Dual Licensing
- Open Source
- Commercial
- 100 Pure Java (runs anywhere that runs Java)
- Quick Start-up with sample scripts Ant tasks
- Extensive JavaDoc
- API Command-line resources
- Production quality code unit testing
10XML Handling SAX Filters
- All input/output is handled through SAX filters
- Streams all I/O at the element level
- An org.xml.sax.ContentHandler receives callbacks
- startElement(Element, Attributes)
endElement(Element) - startDocument() endDocument()
- characters(char cs, int start, int length)
- And a whole lot more
- Not event-based, despite what everyone calls it
- SAX filters
- Same pattern as the Java stream filters (eg.
java.io.InputStreamFilter) - Allow chains of handlers to be combined
- Full XML Processing
- Entities, DTD validation, character sets, etc.
- Supplied filters tunable to input elements, or
can be run on all - text content
11HTML Plain Text Handling
- HTML run through CyberNekos HTML
- Implements org.xml.sax.XMLReader over HTML input
- HTMLs a mess, so youre taking chances
- Plain Text Input
- Handled with SAX filter, with wrapper elements
- Text just sent to characters()
12Tokenization
- General Interface Streams output
- Tokenizer(char, int, int)
- String nextToken()
- String nextWhitespace()
- Whitespaces critical for reconstructing original
document with tags in place - Implementation for Indo-European
- Very fine-grained tokenization
- But try to keep numbers, alphanumerics, and
compound symbols together - 555-1212 100,000 --- 40R
- Not cheating as in many pre-tokenized
evaluations - Break on most punctuation
- Mr. Smith-Jones. yields 6 tokens
13Interfaces Abstract Factories
- Interfaces allow flexible implementations of
tokenizers - Factories allow reflectively specified tokenizer
creation - TokenizerFactory interface (not an abstract
class) - Tokenizer createTokenizer(char cs, int start,
int length) - All APIs accept tokenizer factories for
flexibility - Reflection allows command-line specification
- -tokenizerFactoryfee.fi.fo.fum.TokenizerFactory
- Javas Reflection API used to create the
tokenizer factory - Assumes nullary constructor for factory
- Named-entity extraction and string-matching also
handled with factories for flexible
implementations
14Named Entity Detection
- Balancing Speed With Efficiency
- 100K tokens/second runtime
- Windows XP
- 3GHz P4, 800MHz FSB, 210K ATA disks in RAID-0
- Suns JDK 1.4.2 on Windows XP
- -server mode
- .93 MUC7 F-score (more on scores later)
- Very low dynamic memory requirements due to
streamed output - Train 500K tokens, decode score 50K tokens in
20-30 seconds - Pipelined Extraction of Entities
- Speculative
- User-defined
- Pronouns
- Stop-list Filtering (not in LingPipe, but in
ThreatTracker) - User-defined Mentions, Pronouns Stop list
- Specified in a dictionary
- Left-to-right, Longest match
- Removes overlapping speculative mentions
- Stop list just removes complete matches
15Speculative Named Entity Tagging
- Chunking as Tagging
- Convert a parsing problem to a tagging problem
- Assign ST_TAG, TAG and OUT to tokens
- INPUT John Smith is in Washington.
- OUTPUT JohnST_PERSON SmithPERSON isOUT inOUT
WashingtonST_LOCATION .OUT
16Statistical Named Entity Model
- Generative Statistical Model
- Find most likely tags given words
- ARGMAX_Ts P(TsWs) ARGMAX_Ts P(Ts,Ws)/P(Ws)
-
ARGMAX_Ts P(Ts,Ws) - Predict next word/tag pair based on previous
word/tag pairs - word trigram, tag bigram history
- Decompose into tag and lexical model
- P(wn,tn tn-1, wn-1, wn-2)
- P(tn tn-1, wn-1, wn-2)
tag model - P(wn tn, tn-1, wn-1)
lexical model - State Tying for Lexical Model
- P(wn) tn, tn-1, ) tn-1 doesnt
differentiate TAG and ST_TAG - P(wn tn, tn-1, wn-1, wn-2) P(wn
tn, wn-1 ) if tn ! tn-1 - Bigram model within category
- P(wn tn, tn-1, wn-1, wn-2) P(wn
tn, tn-1) if tn tn-1 - Unigram model cross category
17Smoothing the Named Entity Model
- Witten-Bell smoothing
- Not as accurate as held-out estimation, but much
simpler - P(EC1,C2) lambda(C1,C2) P_ml(EC1,C2)
- (1 lambda(C1,C2)
P(EC1) - lambda(x) events(x) / (events(x) K
outcomes(x)) - Lexical Model smooth to uniform vocab estimate
- Tag Model tag given tag for well-formedness
- Category-based Smoothing of Unknown Tokens
- Assign lexical tokens to categories
- Capitalized, all-caps, alpha-numeric,
numberperiod, etc. - Replace unknown words with categories
- Result is not joint model of P(Ws,Ts)
- OK for maximizing P(TsWs)
- No category-based smoothing of known tokens in
history
18Blending Dictionaries/Gazetteers
- Lexical and Tag models
- Given JohnPERSON
- P(JohnST_PERSON)
- Given John SmithPERSON
- P(SmithPERSON,ST_PERSON,John)
- P(PERSONST_PESON,John)
- Given John Smith JuniorPERSON
- P(JuniorPERSON,PERSON,Smith,John)
- P(PERSONPERSON,Smith,John)
- Easier with pure language-model based system
19Multi-lingual Multi-genre Models
- Based on language segmentation for SpeechWorks
- Trained models for Hindi English
- TIDES Surprise Language 2003
- Ported our ThreatTracker interface
- About ½-1 f-score hit for using multilingual
model - Models dont interfere much
- P(wn tn, tn-1, wn-1)
- Until smoothing to P(wn tn), only use Hindi
context for Hindi following if tn, wn-1 is
known. - P(tn tn-1, wn-1, wn-2)
- Until smoothing to P(tn tn-1)
- Would probably help to model transitions on
multi-lingual data and expected quantity of each
if not uniform - As is, we just trained with all the data we had
(400K toks/language) - Not nearly as bad as HMMs for pronunciation
variation
20Named Entity Algorithms
- See Dan Gusfields book Algorithms on Strings
and Trees - Must read for non-statistical string algorithms
- Also great intro to suffix trees and
computational biology - Theoretically linear in input text size tag set
size - Beam greatly reduces dependence on tagging
- Smoothing ST_TAG and TAG reduces contexts by half
- Dictionary-based tagging
- Aho-Corasick Algorithm is linear asymptotically
- Trie with suffix-to-prefix matching
- Actually more efficient to just hash prefixes for
short strings - Statistical Model Decoding
- Simple dynamic programming (often called
Viterbi) - Only keep best analysis for outcome given history
- Outcomes are tags, and only bigram tag history
- Lattice slicing for constant memory allocation
(vs. full lattice) - Allocate a pair of arrays sized by tags and
re-use per token - Still need backpointers, but in practice, very
deterministic - Rely on Javas Garbage Collection
21So whys it so slow?
- Limiting factor is memory to CPU bandwidth
- aka frontside bus (FSB)
- Determined by Chipset, motherboard memory
- Best Pentium FSB 800MHz (vs 3.2GHz CPU)
- Best Xeon FSB 533MHz
- Models are 2-15 MB, even pruned packed
- CPU L2 Cache sizes are 512K to 1MB
- Thus, most model lookups are cache misses
- Same issue as database paging, only closer to CPU
22Packing Models into Memory
- Based on SpeechWorks Language ID work
- Had to run on a handheld with multiple models
- Prune Low Counts
- Better to do Relative Entropy Based Pruning
Eliminate estimate counts that are similar to
smoothed estimates - Symbol tables for tokens 32-bit floating point
- At SPWX, mapped floats to 16-bit integers
- Trie-structure from general to specific contexts
- Only walk down until context is found (Lambda !
0.0) - P(wn tn, tn-1, wn-1)
- Contexts ? tn ? tn-1 ? wn-1 log(1
lambda(context)) - Outcomes ? wn ? wn ? wn
log(P(wn context) - Array-based with binary search
- Binary search is very hard on memory with large
arrays - Better to hash low-order contexts, OK for smaller
contexts - Im going to need the board for this one
23Named Entity Models and Accuracy
- Spanish News (CoNLL) P.95, R.96, F.95
- English News (MUC7) P.95, R.92, F.93
- Hindi News (TIDES SL) P.89, R.84, F.86
- English Genomics (GENIA) P.79, R.79, F.79
- Dutch News (CoNLL) P.90, R.68, F.77
- All tested without Gazetteers
- All Caps models only 5-10 less accurate
24Within-Document Coreference
- Mentions merged into mention chains
- Greedy left-to-right algorithm over mentions
- Computes match of mention vs. all previous
mention chains - No-match creates new mention chain
- Ties cause new mention chain (or can cause
tighter match) - Matching functions determined by entity type
(PERSON, ORGANIZATION, etc.) - Generic matching functions for token-sensitive
edit distance - Next step is soundex style spelling variation
- Specialized matching for pronouns and gender
- Matching functions may depend on user-defined
entities providing thesaurus-like expansion (Joe
Dimaggio and Joltin Joe or the Yankee
Clipper) - User-configurable matching based on entity type
(e.g. PROTEIN) - Next step is to add contextual information
25Cross-Document Coreference
- Mention Chains merged into entities
- Greedy order-independent algorithm over mention
chains - Matching functions involve complex reasoning over
sets of mentions in chain versus sets of mention
in candidate entities. - Matching involves properties of the mentions in
the whole database and degree of overlap - Joe or Bush show up in too many entities to
be good distinguishing matchers - Chain John Smith, Mr. Smith, Smith
- Entity1 John Smith Jr., John Smith, John, Smith
- Entity 2 John Smith Sr., John Smith, Jack Smith,
Senior - Chain John James Smith, John Smith
- Entity John Smith, Smith, John K. Smith
- Only pipeline component that must run
synchronously. - Only takes 5 of pipeline processing time.
- Next Step (recreating Bagga/Baldwin) Contextual
Information
26Document Feed Web Service for DARPA
- HTTP Implementation of Publish/Subscribe.
- Implemented as Servlets.
- Subscribers submit URL to receive documents.
- Publishers submit binary documents.
- May be validated if form is know eg. XML DTD.
- Subscribers receive all published documents via
HTTP. - A more general implementation allows reception by
topic.
27Whats next?
- Goal is total recall, with highest possible
precision - Finding spelling variations of names
- Suffix Trees
- Edit Distance (weighted by spelling variation)
- Cross-linguistically (pronunciation transduction)
- Context (weighted keyword in context)
- Over 100K newswire articles
- Name structure
- Nicknames RobertBob
- Acronyms International Business MachinesIBM
- Abbreviationss Bob CoBob Corporation
28Analyzed Document Format
- lt!ELEMENT DOCUMENT (P)gt
- lt!ATTLIST DOCUMENT
- uri CDATA REQUIRED
- source CDATA REQUIRED
- language CDATA REQUIRED
- title CDATA REQUIRED
- classification CDATA "UNCLASSIFIED"
- date CDATA REQUIREDgt
- lt!ELEMENT P (S)gt
- lt!-- Analysis adds rest of data to input document
--gt - lt!ELEMENT S (PCDATA enamex)gt
- lt!ELEMENT ENAMEX (PCDATA)gt
- lt!ATTLIST ENAMEX
- id CDATA REQUIRED
- type CDATA REQUIREDgt