Alias I Linguistic Pipeline Architecture, Algorithms - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Alias I Linguistic Pipeline Architecture, Algorithms

Description:

Based in Brooklyn (we love visitors) Application: Threat Tracker Interface ... Trackers Evaluated against real-time feeds and/or standing collections. Tracker ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 29

Provided by: BobCar6

Category:

more less

Transcript and Presenter's Notes

Title: Alias I Linguistic Pipeline Architecture, Algorithms

1
Alias I Linguistic PipelineArchitecture,
Algorithms Applications

Bob Carpenter
Alias I, Inc.
carp_at_aliasi.com

2
Who is Alias-i?

Spun out of 1995 U Penn Message Understanding
Conference (MUC-5) projects on coreference
Founded in 2000 by Breck Baldwin as Baldwin
Language Technologies
Im the other technical employee as of 2003.
Funded through the Defense Advance Research
Projects Agency (DARPA) through the Translingual
Information Detection, Extraction and
Summarization Program (TIDES) and the Total, er
Terrorist Information Awareness Program (TIA)
Targeting Research Analysts with Text Mining
Based in Brooklyn (we love visitors)

3
Application Threat Tracker Interface

Intended for use by Information Analysts
Analysts typically get short-term assignments and
are asked to do thorough reviews, producing
reports at the end.
Some analysts are assigned to track situations
longer term.
Use unstructured news feeds and standing
collections as sources
Basically, a lot like legal, medical or
biological research
Trackers Specify Structured Searchers Gatherers
Entities, Sub-trackers, Sample Documents, Saved
Keyword Searches, Alerts
Allow addition of annotated documents making up a
case
Entities Specify
Aliases
Spelling, Language, Coreference Properties
Properties
Person (Gender), Place, Thing, Other
Trackers Evaluated against real-time feeds and/or
standing collections

4
Tracker Example(s)

Tracker New York Yankees
Entity New York Yankees
Aliases Bronx bombers,
Properties Organization
Tracker Yankee Players
Entity Joe Dimaggio
Aliases Joltin Joe, The Yankee Clipper, Joe D
Properties Person/male
Entity Babe Ruth
Entity Yankee Stadium
Aliases The stadium, The house that Ruth built,
Properties Facility
Document (The Onion) Steinbrenner corners
free-agent market.
Tracker Sports
Tracker Baseball
Tracker Teams
Tracker NY Yankees

5
ThreatTracker Interface Screenshot

indicates sentences have been removed because
they dont mention the Entity
Translation of Excerpt Summary
Mentions of Vajpayee and Pakistan found by
ThreatTrackers
6
ThreatTracker Architecture
7
Client and Web-Container ArchitectureFlexible
Model-View-Controller (MVC)
8
ThreatTrackerDocumentAnalysis20k
words/sec250k docs/1.5GB
9
LingPipe Document Analysis

LingPipe implements (most of) Document Analysis
XML, HTML and Plain Text input (well-formed) XML
output
Tokenization
Named-entity Extraction
Sentence Boundary Detection
Within-document Coreference
Not yet released cross-document coreference
Dual Licensing
Open Source
Commercial
100 Pure Java (runs anywhere that runs Java)
Quick Start-up with sample scripts Ant tasks
Extensive JavaDoc
API Command-line resources
Production quality code unit testing

10
XML Handling SAX Filters

All input/output is handled through SAX filters
Streams all I/O at the element level
An org.xml.sax.ContentHandler receives callbacks
startElement(Element, Attributes)
endElement(Element)
startDocument() endDocument()
characters(char cs, int start, int length)
And a whole lot more
Not event-based, despite what everyone calls it
SAX filters
Same pattern as the Java stream filters (eg.
java.io.InputStreamFilter)
Allow chains of handlers to be combined
Full XML Processing
Entities, DTD validation, character sets, etc.
Supplied filters tunable to input elements, or
can be run on all
text content

11
HTML Plain Text Handling

HTML run through CyberNekos HTML
Implements org.xml.sax.XMLReader over HTML input
HTMLs a mess, so youre taking chances
Plain Text Input
Handled with SAX filter, with wrapper elements
Text just sent to characters()

12
Tokenization

General Interface Streams output
Tokenizer(char, int, int)
String nextToken()
String nextWhitespace()
Whitespaces critical for reconstructing original
document with tags in place
Implementation for Indo-European
Very fine-grained tokenization
But try to keep numbers, alphanumerics, and
compound symbols together
555-1212 100,000 --- 40R
Not cheating as in many pre-tokenized
evaluations
Break on most punctuation
Mr. Smith-Jones. yields 6 tokens

13
Interfaces Abstract Factories

Interfaces allow flexible implementations of
tokenizers
Factories allow reflectively specified tokenizer
creation
TokenizerFactory interface (not an abstract
class)
Tokenizer createTokenizer(char cs, int start,
int length)
All APIs accept tokenizer factories for
flexibility
Reflection allows command-line specification
-tokenizerFactoryfee.fi.fo.fum.TokenizerFactory
Javas Reflection API used to create the
tokenizer factory
Assumes nullary constructor for factory
Named-entity extraction and string-matching also
handled with factories for flexible
implementations

14
Named Entity Detection

Balancing Speed With Efficiency
100K tokens/second runtime
Windows XP
3GHz P4, 800MHz FSB, 210K ATA disks in RAID-0
Suns JDK 1.4.2 on Windows XP
-server mode
.93 MUC7 F-score (more on scores later)
Very low dynamic memory requirements due to
streamed output
Train 500K tokens, decode score 50K tokens in
20-30 seconds
Pipelined Extraction of Entities
Speculative
User-defined
Pronouns
Stop-list Filtering (not in LingPipe, but in
ThreatTracker)
User-defined Mentions, Pronouns Stop list
Specified in a dictionary
Left-to-right, Longest match
Removes overlapping speculative mentions
Stop list just removes complete matches

15
Speculative Named Entity Tagging

Chunking as Tagging
Convert a parsing problem to a tagging problem
Assign ST_TAG, TAG and OUT to tokens
INPUT John Smith is in Washington.
OUTPUT JohnST_PERSON SmithPERSON isOUT inOUT
WashingtonST_LOCATION .OUT

16
Statistical Named Entity Model

Generative Statistical Model
Find most likely tags given words
ARGMAX_Ts P(TsWs) ARGMAX_Ts P(Ts,Ws)/P(Ws)
ARGMAX_Ts P(Ts,Ws)
Predict next word/tag pair based on previous
word/tag pairs
word trigram, tag bigram history
Decompose into tag and lexical model
P(wn,tn tn-1, wn-1, wn-2)
P(tn tn-1, wn-1, wn-2)
tag model
P(wn tn, tn-1, wn-1)
lexical model
State Tying for Lexical Model
P(wn) tn, tn-1, ) tn-1 doesnt
differentiate TAG and ST_TAG
P(wn tn, tn-1, wn-1, wn-2) P(wn
tn, wn-1 ) if tn ! tn-1
Bigram model within category
P(wn tn, tn-1, wn-1, wn-2) P(wn
tn, tn-1) if tn tn-1
Unigram model cross category

17
Smoothing the Named Entity Model

Witten-Bell smoothing
Not as accurate as held-out estimation, but much
simpler
P(EC1,C2) lambda(C1,C2) P_ml(EC1,C2)
(1 lambda(C1,C2)
P(EC1)
lambda(x) events(x) / (events(x) K
outcomes(x))
Lexical Model smooth to uniform vocab estimate
Tag Model tag given tag for well-formedness
Category-based Smoothing of Unknown Tokens
Assign lexical tokens to categories
Capitalized, all-caps, alpha-numeric,
numberperiod, etc.
Replace unknown words with categories
Result is not joint model of P(Ws,Ts)
OK for maximizing P(TsWs)
No category-based smoothing of known tokens in
history

18
Blending Dictionaries/Gazetteers

Lexical and Tag models
Given JohnPERSON
P(JohnST_PERSON)
Given John SmithPERSON
P(SmithPERSON,ST_PERSON,John)
P(PERSONST_PESON,John)
Given John Smith JuniorPERSON
P(JuniorPERSON,PERSON,Smith,John)
P(PERSONPERSON,Smith,John)
Easier with pure language-model based system

19
Multi-lingual Multi-genre Models

Based on language segmentation for SpeechWorks
Trained models for Hindi English
TIDES Surprise Language 2003
Ported our ThreatTracker interface
About ½-1 f-score hit for using multilingual
model
Models dont interfere much
P(wn tn, tn-1, wn-1)
Until smoothing to P(wn tn), only use Hindi
context for Hindi following if tn, wn-1 is
known.
P(tn tn-1, wn-1, wn-2)
Until smoothing to P(tn tn-1)
Would probably help to model transitions on
multi-lingual data and expected quantity of each
if not uniform
As is, we just trained with all the data we had
(400K toks/language)
Not nearly as bad as HMMs for pronunciation
variation

20
Named Entity Algorithms

See Dan Gusfields book Algorithms on Strings
and Trees
Must read for non-statistical string algorithms
Also great intro to suffix trees and
computational biology
Theoretically linear in input text size tag set
size
Beam greatly reduces dependence on tagging
Smoothing ST_TAG and TAG reduces contexts by half
Dictionary-based tagging
Aho-Corasick Algorithm is linear asymptotically
Trie with suffix-to-prefix matching
Actually more efficient to just hash prefixes for
short strings
Statistical Model Decoding
Simple dynamic programming (often called
Viterbi)
Only keep best analysis for outcome given history
Outcomes are tags, and only bigram tag history
Lattice slicing for constant memory allocation
(vs. full lattice)
Allocate a pair of arrays sized by tags and
re-use per token
Still need backpointers, but in practice, very
deterministic
Rely on Javas Garbage Collection

21
So whys it so slow?

Limiting factor is memory to CPU bandwidth
aka frontside bus (FSB)
Determined by Chipset, motherboard memory
Best Pentium FSB 800MHz (vs 3.2GHz CPU)
Best Xeon FSB 533MHz
Models are 2-15 MB, even pruned packed
CPU L2 Cache sizes are 512K to 1MB
Thus, most model lookups are cache misses
Same issue as database paging, only closer to CPU

22
Packing Models into Memory

Based on SpeechWorks Language ID work
Had to run on a handheld with multiple models
Prune Low Counts
Better to do Relative Entropy Based Pruning
Eliminate estimate counts that are similar to
smoothed estimates
Symbol tables for tokens 32-bit floating point
At SPWX, mapped floats to 16-bit integers
Trie-structure from general to specific contexts
Only walk down until context is found (Lambda !
0.0)
P(wn tn, tn-1, wn-1)
Contexts ? tn ? tn-1 ? wn-1 log(1
lambda(context))
Outcomes ? wn ? wn ? wn
log(P(wn context)
Array-based with binary search
Binary search is very hard on memory with large
arrays
Better to hash low-order contexts, OK for smaller
contexts
Im going to need the board for this one

23
Named Entity Models and Accuracy

Spanish News (CoNLL) P.95, R.96, F.95
English News (MUC7) P.95, R.92, F.93
Hindi News (TIDES SL) P.89, R.84, F.86
English Genomics (GENIA) P.79, R.79, F.79
Dutch News (CoNLL) P.90, R.68, F.77
All tested without Gazetteers
All Caps models only 5-10 less accurate

24
Within-Document Coreference

Mentions merged into mention chains
Greedy left-to-right algorithm over mentions
Computes match of mention vs. all previous
mention chains
No-match creates new mention chain
Ties cause new mention chain (or can cause
tighter match)
Matching functions determined by entity type
(PERSON, ORGANIZATION, etc.)
Generic matching functions for token-sensitive
edit distance
Next step is soundex style spelling variation
Specialized matching for pronouns and gender
Matching functions may depend on user-defined
entities providing thesaurus-like expansion (Joe
Dimaggio and Joltin Joe or the Yankee
Clipper)
User-configurable matching based on entity type
(e.g. PROTEIN)
Next step is to add contextual information

25
Cross-Document Coreference

Mention Chains merged into entities
Greedy order-independent algorithm over mention
chains
Matching functions involve complex reasoning over
sets of mentions in chain versus sets of mention
in candidate entities.
Matching involves properties of the mentions in
the whole database and degree of overlap
Joe or Bush show up in too many entities to
be good distinguishing matchers
Chain John Smith, Mr. Smith, Smith
Entity1 John Smith Jr., John Smith, John, Smith
Entity 2 John Smith Sr., John Smith, Jack Smith,
Senior
Chain John James Smith, John Smith
Entity John Smith, Smith, John K. Smith
Only pipeline component that must run
synchronously.
Only takes 5 of pipeline processing time.
Next Step (recreating Bagga/Baldwin) Contextual
Information