Title: Applications of Statistical Natural Language Processing
1Applications of Statistical Natural Language
Processing
- Shih-Hung Wu
- Dept. of CSIE,
- Chaoyang University of Technology
2My Research Topics
- Information Retrieval
- LSI text classification, Ontological Engineering,
Question Answering - Multi-Agent coordination
- Game theoretical Multi-agent Negotiation
- NLP
- Shallow parsing, grammar debug, Named Entity
Recognition - Learning Technology
- Word Problems in primary school mathematics
- Web Intelligence
- Wikipedia wrapper, Cross-language IR
3Outline
- AI NLP
- Statistical Natural Language Processing
- Example 1Prepostion Usage by Language Model
- Sequential tagging technique
- Maximum Entropy
- Example 2 Bio NER
- Example 3 Chinese Shallow Parsing
4AI expected
- Robots that can understand humans language and
interact with human.
C3PO and R2DR
5What we have now
- Home robots that can dance or clean dust
6What is lost?
- They cannot use natural language well
7Research Topics in NLPIR
- Applications of NLPIR
- Question Answering system, Input system,
Information extraction, Ontology extraction,
Grammar Checker - Other Tough NLP
- Machine Translation, Summary, Natural Language
Generation - Sub goal in NLP
- Word segmentation, POS tagging, Full Parsing,
Alignment, Suffix pattern, Shallow parsing,
Semantic Role Labeling
8Nature Language Processing (NLP)
- Categories of the Development in NLP
- Corpus-based Methods
- Statistical Methods
- Textbooks of NLP
Allen (1995) Natural Language Understanding
Manning (1999) Foundations of Statistical NLP
Jurafsky (2000) Speech and Language Processing
Jackson (2002) NLP for Online Application
9Methodology
- Machine Learning Pattern Recognition
Statistical NLP share the same methodology - Training and Test
- Training from a large set of training examples
- Test on an independent set of examples
10Preposition Usage by Language Model
- Based on a conference paper
- Shih-Hung Wu, Chen-Yu Su, Tian-Jian Jiang,
Wen-Lian Hsu , An Evaluation of Adopting
Language Model as the Checker of Preposition
Usage, Proceedings of ROCLING 2006.
11Motivation
- Microsoft word detects grammar errors but do
not deal with the usages of prepositions - Language Model can predict next word
- Original ideal Use language model to predict the
right preposition - Current approach Calculate the probability of
each sentence
12Language Model
- An LM uses short history to predict the next
word. - Ex.
- Sue swallowed the large green ___.
- ___could be Pill or frog
- large green ___
- ___could be Pill or frog
- Markov assumption
- Only the prior local context affects
13Sentence Probability
- The probability of a sentence in a language, say
English, is defined as the probability of the
sequence
- Decomposed by the chain rule of conditional
probability under the Markov assumption
14Maximum Likelihood Estimation
Maximum Likelihood Estimation bi-gram
Maximum Likelihood Estimation n-gram
15Smoothing 1 GT
- Good-Turing Discounting (GT)
- adjusts the count of n-gram from r to r, which
is base on the assumption that their distribution
is binomial Good, 1953.
16Smoothing 2AD
- Absolute Discounting (AD)
- all the unseen events gain the frequency
uniformly. Ney et al., 1994
17Smoothing 3mKN
- Modified Kneser-Ney discounting (mKN)
- mKN has three different parameters, D1, D2, and
D3 that are applied to n-grams with one, two, and
three or more counts Chen and Goodman, 1998
18Entropy and Perplexity
19Experiment setting training
- Training sets for bi-gram model
- Training sets for tri-gram model
20Closed test setting
- select 100 sentences from the training corpus
- replacing the correct preposition with other
prepositions - Ex.
- My sister whispered in my ear.
- My sister whispered on my ear.
- My sister whispered at my ear.
- Assume the sentence with the lowest perplexity is
considered as the correct one.
21Closed test results
22Open test
- 100 sentences that we collect from
- 1. Amusements in Mathematics by Henry Ernest
Dudeney. - 2. Grimm's Fairy Tales by Jacob Grimm and Wilhelm
Grimm. - 3. The Art of War by Sun-Zi.
- 4. The Best American Humorous Short Stories.
- 5. The War of the Worlds by H. G. Wells.
- Than use the same replacement setting as in
closed test
23Open test results
24TOEFL test
- 100 test questions like this one
My sister whispered __ my ear. (a) in (b) to (c)
with (d) on
- Calculate the probability of the sentences
- My sister whispered in my ear.
- My sister whispered to my ear.
- My sister whispered with my ear.
- My sister whispered on my ear.
(correct)
(wrong) (wrong) (wrong)
25TOEFL test results
26Error case 1
27Error case 2
28Error case 3
29Discussions
- The experiment results show that the accuracy of
open test is 71 and the accuracy of closed test
is 89. The accuracy is 70 on TOEFL-level tests. - Use only Untagged Corpus
30Future works
- Encode more features into the statistical model
- Rule Use in (not at) before the names of
countries, regions, cities, and large towns. - NER is necessary
- Use advance statistical model
- Maximum Entropy (ME) Berger et al., 1996
- Conditional Random Fields (CRF) Lafferty et al.,
2001
31Sequential Tagging with Maximum Entropy
32Sequence Tagging
- Given a sequence Xx1,x2,x3,,xn
- and a tag set Tt1,t2,t3,,tm
- Find the most possible valid sequence
Yy1,y2,y3,,yn - where yi ? T
-
-
33Sequence Tagging-Naïve Bayes
- Calculate P(yix1,x2,x3,,xn)ignore features
other than unigrams - Naïve Bayes
- argmax P(yix1,x2,x3,,xn)
- yi?T
- argmax P(yi)P(x1,x2,,xnyi) by Bayes theorem
yi?T - argmax P(yi)P(x1yi)P(x2yi)P(xnyi)
Independent - yi?T
- Pros simple
- Cons cannot handle overlapped features due to
its - independent assumption
34Maximum Entropy
35Applications
- NLP tasks
- Machine Translation Berger et al., 1996
- Sentence Boundary Detection
- Part-of-Speech Tagging Ratnaparkhi, 1998,
- NER in Newswire Domain Borthwick, 1998
- Adaptive Statistical Language Modeling
Rosenfeld, 1996 - Chunking Osborne, 2000
- Junk Mail Filtering Zhang, 2003
- Biomedical NER Lin et al., 2004
36Maximum Entropy-A Simple Example
- Suppose yi only relate to xi (unigram)
- the alphabet of X is denoted as Sa,b,c,
Tt1,t2 - We define a joint probability distribution p
over ST - Given N observations (x1,y1)(x2, )(xn,yn)
- such as (a, t1)(b, t2)(c, t1)
- How can we estimate the underlie model p?
37The First Constraint (Cont.)
- Two possible distributions are
- Intuitively, we think the right one is better.
Since it is more uniform than the left one
38The Second Constraint (Cont.)
- Now we observe that t1appears 70, so we add a
new constraint to our model PaPbPc0.7 - Again, two possible distributions
more uniform
39The Third Constraint (Cont.)
- Now we observe that a appears 50, so we add a
new constraint to our model Pat1Pat20.5 - Two Questions
- What does uniform mean?
- How can we find the most uniform model subject to
a set of constraints -
40Entropy
- A mathematical measure of the uniformity
(uncertainty) - For a conditional distribution its
conditional entropy is
41Maximum Entropy Solution
- Goal to select a model from a set C of
allowed probability distributions which maximizes
entropy H(p) - In other words, p is the most uniform
distribution we can have. - We call p the maximum entropy solution.
42Expectation of Features
- Once a feature is defined, its expectation is
- Its observed expectation is
- We require
- More explicit, we write
43Represent Constraints by Features
- Under ME framework, constraints imposed on a
model are represented by features known as
feature function in the form - For example, previous constraints can be written
as
44Expectation of Features (Cont.)
- In previous example, the fact that t1 appears can
be formalized as - And the fact that a appears 50 can be written
as
45Maximum Entropy Framework
- In general, suppose we have k constraints
(features), we would like to find a model p lies
in the subset C of P - defined by
- which maximizes entropy
46Maximum Entropy Solution
- It can be proved that the Maximum Entropy
solution p must have the form - Where k is the number of features and is a
normalization factor to ensure that
47Maximum Entropy Solution (Cont.)
- So our task is to estimate parameters ?i in p
which maximize H(p). - In simple cases, we can find the solution
analytically (like the previous example), when
the problem become more complex, we need to find
a way to automatically derive ?i, given a set of
constraints.
48Parameters Estimation
- A Constrained Optimization Problem
- Finding a set of parameters
- ??1,?2,?3,, ?n of an exponential model which
maximizes its log likelihood. - To find values for the parameters of p
- Generalized Iterative Scaling Darroch and
Ratchliff, 1972 - Improve Iterative Scaling Berger et al., 1996
49ME's Adv. And Disadvantage
- Advantage
- Knowledge-Poor Features
- Reusable Software
- Free Incorporation of Overlapping and
- Interdependent Features
- disadvantage
- Slow Training Procedure
- No Explicit Controls on Parameter Variance (like
SVMs)
50Biomedical NER
- Based on an in press journal paper
- Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu,
- Integrating Linguistic Knowledge into a
Conditional Random Field Framework to Identify
Biomedical Named Entities, Expert Systems with
Applications, Volume 30, Issue 1, January 2006,
pp. 117-128. (SCI)
51Sequence Tagging-NER Example
- Determine the best tags for a sentence IL-2 gene
induced NF-Kappa B - We can formulate this example as
- XIL-2,gene,induced,NF-Kappa,B
- assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
- gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
- gt candidate2 of YO,Ps,Pe,O,O (valid)
-
- gt candidatek of YDs,De,O,Ps,Pe (valid)
-
- The answer of this example is YDs,De,O,Ps,Pe
52Decoding
- By multiply all P(yicontexti), we can get the
probability of a tag sequence Y - But, some tag sequences are invalid
- Ex XIL-2,gene,induced,NF-Kappa,B
- assume TPs,Pc,Pe,Pu,Ds,Dc,De,Du,O
- gt candidate1 of YPs,Ps,Ps,Ps,Ps (invalid)
- gt candidatek of YDs,De,O,Ps,Pe (valid)
-
- Even P(candidate1)gtP(candidatek),
- still candidatek.
- Use Viterbi Search to find the most probable tag
sequence
53Biomedical Named Entity Recognition
- were used to model changes in susceptibility to
NK cell killing caused by transient vs stable - We assign a tag to the current token xi according
to the its features in context.
Protein_St
RNA_St
Context
Feature 1 AllCaps
DNA_St
xi the current token
54Biomedical Named Entity Recognition
- BioNER identify biomedical names in text and
categorize them into different types - Essentials
- Tagged Corpus
- GENIA
- Features
- Orthographical Features
- Morphological Features
- Head Noun Features
- POS Features
- Tag Set
55Biomedical NER- Tagged Corpus
- GENIA Corpus (Ohta et al. 2002)
- V1.1-670 MEDLINE abstracts
- V2.1-670 MEDLINE abstracts and POS tagging
- V3.0-2000 MEDLINE abstracts
- V3.0p-2000 MEDLINE abstracts and POS tagging
- V3.02
- POS Tag Set
- Penn Treebank (PTB)
56Biomedical NER-Internal/External Features
- Internal
- Found within the name string itself
- e.g., primary NK cells
- External
- Context
- e.g., activate ROI formation
- The CD4 coreceptor interacts
AllCaps
57Orthographical Features
58Morphological Features
59Head Nouns
60Biomedical NER-Tag Set
- 23 NE Categories
- Protein, Other Name, DNA, CellType, Other
Organic, - CellLine, Lipid, Multi Cell,Virus, RNA,
CellComponent, - Body Part,Tissue, AminoAcidMo, Polynucleotide,
- Mono Cell, Inorganic, Peptide, Nucleotide,
- Atom, Other Artificial, Carbohy drate, Organic
- Each NE category c has 4 tags c_start,
c_continue, c_end, c_unique, - Ex Protein has protein_start, protein_continue,
protein_end, protein_unique - In addition, theres a non-NE tag o.
- Therefore, T234193
61Bio NERSystem Architecture
Syste
62Nested Named Entity
- Nested Annotation
- ltRNAgtltDNAgtCIITAlt/DNAgtmRNAlt/RNAgt
- In the perspective of parsing, we prefer bigger
chucks, that is,ltRNAgtCIITA mRNAlt/RNAgt, However,
ME sometimes only recognizes CIITA as DNA - 16.57 of NEs in GENIA 3.02 contains one or more
shorter NE Zhang, 2003 - Solution-Post Processing
- Boundary Extension
- Re-classification
63Post-ProcessingBoundary Extension
- Boundary extension for nested NEs
- Extend the R-boundary repeatedly if the NE is
followed by another NE, a head noun, or an
R-boundary word with a valid POS tag. - Extend the left boundary repeatedly if the NE is
preceded by a L-boundary word with a valid POS
tag. - Boundary extension for NEs containing brackets or
slashes - NE NE ( NE ) NE or head noun or
R-boundary word with valid POS tag - NE NE / NE ( / NE ) NE or head
noun or R-boundary word with valid POS tag
64Experimental ResultsState-of-the-art Systems
GENIA v3.02 (10 Fold-CV)
65Remarks
- ME offers a clear way to incorporate
- various evidence into a single, powerful model.
- It detects 80 of the rough position of an
Bio-NE. - Due to nested annotations in GENIA and
preferences of bigger chucks, we applies a
post-processing techniques and gets the highest
F-Score in GENIA so far.
66References
- Borthwick, A. (1999). A Maximum Entropy Approach
to Named Entity Recognition, New York University. - Hou, W.-J. and H.-H. Chen (2003). Enhancing
Performance of Protein Name Recognizers Using
Collocation. ACL Workshop on Natural Language
Processing in Biomedicine, Sapporo, Japan. - Kazama, J., T. Makino, et al. (2002). Tuning
Support Vector Machine for Biomedical Named
Entity Recognition. ACL Workshop on NLP in the
Biomedical Domain. - Lee, K.-J., Y.-S. Hwang, et al. (2003). Two-Phase
Biomedical NE Recognition based on SVMs. ACL
Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan. - McDonald, D. (1996). Internal and External
Evidence in the Identification and Semantic
Categorization of Proper Names. Corpus Processing
for Lexical Acquisition. B. Boguraev and J.
Pustejovsky. Cambridge, MA, MIT Press 21-39. - Nenadic, G., S. Rice, et al. (2003). Selecting
Text Features for Gene Name Classification from
Documents to Terms. ACL Workshop on Natural
Language Processing in Biomedicine, Sapporo,
Japan. - Ohta, T., Y. Tateisi, et al. (2002). The GENIA
corpus An annotated research abstract corpus in
molecular biology domain. HLT 2002. - Shen, D., J. Zhang, et al. (2003). Effective
Adaptation of Hidden Markov Model-based Named
Entity Recognizer for Biomedical Domain. ACL
Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan. - Takeuchi, K. and N. Collier (2003). Bio-Medical
Entity Extraction using Support Vector Machines.
ACL Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan. - Torii, M., S. Kamboj, et al. (2003). An
Investigation of Various Information Sources for
Classifying Biological names. ACL Workshop on
Natural Language Processing in Biomedicine,
Sapporo, Japan. - Tsuruoka, Y. and J. i. Tsujii (2003). Boosting
Precision and Recall of Dictionary-Based Protein
Name Recognition. ACL Workshop on Natural
Language Processing in Biomedicine, Sapporo,
Japan. - Yamamoto, K., T. Kudo, et al. (2003). Protein
Name Tagging for Biomedical Annotation in Text.
ACL Workshop on Natural Language Processing in
Biomedicine, Sapporo, Japan. - Zhang, J., D. Shen, et al. (2003). Exploring
Various Evidences for Recognition of Named
Entities in Biomedical Domain. EMNLP 2003.
67Chinese Shallow Parsing
- Based on a conference paper
- Shih-Hung Wu, Cheng-Wei Shih, Chia-Wei Wu,
Tzong-Han Tsai, and Wen-Lian Hsu, Applying
Maximum Entropy to Robust Chinese Shallow
Parsing, Proceedings of ROCLING 2005, NCKU,
Tainan.
68Outline
- Introduction
- Method
- Sequential tagging
- Maximum Entropy
- Experiment
- Noise Generation
- Test on the Noisy Training Set
- Conclusion and future works
69Introduction
- Full-Parsing is useful but difficult
- Ambiguity, Unknown word
- Chunking is achievable
- Fast and robust (suitable for online
applications) - Shallow Parsing (Chunking) applications
- information retrieval, information extraction,
question answering, and automatic document
summarization - Our goal
- Build a Chinese shallow parser and test the
robustness
70Chunking with unknown word
- ???/??/?/??/??/???/??
- Standard chunking ????????/NP ??/Dd ???/DM
??/VP - If ?/?/? VH13/Dd/P15 is an unknown word, then
the chunking might be - ??/NP ??????/PP ??/Dd ???/DM ??/VP
71Related works in Beijing, Harbin, Shenyang, and
Hong Kong 10, 15, 16, 20, 21
- Standard
- News corpus, UPenn Treebank, Sinica Treebank
- Method
- Memory-based learning, Naïve Bayes, SVM, CRF, ME
- Evaluation
- Perplexity, Token accuracy, Chunk accuracy
- Noisy Model
- random noise, filled noise, and repeated noise
13
72Chunk standard
- First level phrases of Sinica Treebank 3.0
73Phrases
74Sequential Tagging Scheme
75 Our Tagset
- Each token is tagged by one of the 11 tags
- NP_begin, NP_continue,
- VP_begin, VP_continue,
- PP_begin, PP_continue,
- GP_begin, GP_continue,
- S_begin, S_continue,
- and X(others).
76Maximum Entropy 2
- Conditional Exponential Model
- Binary-valued feature functions
- Decoding
77Feature functions
- Each feature is represented by a binary-valued
function
78Conditional Exponential Model
- Feature fi(x,y) is a binary-valued function
- Parameter ?i is a real-valued weight associated
with fi. - Model ???1 , ?2 , ??n
- Normalizing factor
79Conditional Exponential Model
- the probability of observation o, given history h
-
- Feature fi(x,y) is a binary-valued function
- Parameter ai is a real-valued weight associated
with fi. - Normalizing factor Z(h)
80Use the Model
- Training
- Use empirical data to estimate parameter with
- Improved Iterative Scaling (IIS) Algorithm
- Test
- Decoding find the highest probability path
through the lattice of conditional probabilities
81Experiment
82Data and Features
- Sinica Treebank 3.0 contains more than 54,000
sentences, from which we randomly extract 30,000
for training and 10,000 for testing. - Features
- words, adjacent characters, prefixes of words (1
and 2 characters), suffixes of words (1 and 2
characters), word length, POS of words, adjacent
POS tags, and the words location in the chunk it
belongs to
83Noise Model Generation
- Type 1 (single characters)
- ???? Nca replaced by
- ?, Nab, ?, Dbab, ?, Ncda, ?, Nca
- Type 2 (AUTOTAG segmentation)
- ???? Nb would be tagged as ??/Nb, ??/Nb.
84Results and Discussion
85Evaluation Criteria
- We define four types of accuracy
- chunk boundary accuracy
- Ignore the category
- chunk category accuracy
- Ignore the boundary
- Token accuracy
- Chunk accuracy
86Evaluation Criteria Example
- Standard parsing
- ???/NP ??/VC ?/NP ?-???/VP
- 4 chunks, 5 tokens
- If the parsing result is
- ???/NP ??/VC ?/NP ?/Db ???/VE
- Then the
- chunk boundary 3/4 0.75
- chunk category 3/4 0.75
- Token accuracy 3/5 0.6
- Chunk accuracy 3/4 0.75
87Result of Type 1 noisy data
- the percentage of Nb and Nc replaced by
single character noisy data
88Evaluation of the boundaries in different
experiment configurations
89Evaluation of the chunking category in different
experiment configurations
90Evaluation of tokens in different experiment
configurations
91Evaluation of chunks in different experiment
configurations
92Results of Type 2 noisy data
- C-C Using a clean training model and clean test
data. - C-N Using a clean training model and noisy test
data in which all Nb and Nc are replaced by
tokenized results. - N-C Using a training model with noisy data in
which all Nb and Nc are replaced by the
tokenized results of chunking clean test data. - N-N Both the training model and the test data
have noisy data in which all Nb and Nc are
replaced by tokenized results.
93Tokenized string noisy data (the Type 2 noise
model) vs. the AUTOTAG-parsed model
94Error Analysis
95Chunking examples with Type 1 noise
96Shallow parsing examples with Type 2 noise
97Shallow parsing examples with AUTOTAG-parsed
training data and test data
98Chunk results of Open Corpus
99Conclusion and Future Works
- can chunk Chinese sentences into five chunk types
- accuracy of data with simulated unknown words
only decreases slightly in chunk parsing - On open corpus yields interesting chunking
results. - Future works
- adopting other POS systems, such as the Penn
Chinese Treebank tagset, for Chinese shallow
parsing could prove both interesting and useful - adding more types of noise, such as random noise,
filled noise, and repeated noise proposed by
Osborne 13. - In addition to Sinica Treebank, we will extend
our training corpus by incorporating other
corpora, such as Penns Chinese Treebank.
100Appendix
- ME and Improved Iterative Scaling Algorithm
101Model a Problem
102Conditional Exponential Model
- Feature fi(x,y) is a binary-valued function
- Parameter ?i is a real-valued weight associated
with fi. - Model ???1 , ?2 , ??n
- Normalizing factor
103Notes on the Model
- Features is domain dependent
- The exponential form guarantees positive
- Initially, the parameters ?1 , ?2 , ??n are
unknown - Use empirical data to estimate them
- maximum log-likelihood
104Maximum log-likelihood
- Given a joint empirical distribution
- Log-likelihood as a measure of the quality of the
model ? - Log-likelihood ? 0 always
- Log-likelihood 0 is optimal
105Maximum likelihood of Conditional Exponential
Model
- Differentiating with respect to each ?i
- The expectation of fi(x,y) with respect to the
empirical distribution and the model
106From Maximum Entropy to Exponential Model
- Through Lagrange Multipliers
107Entropy -gt Lagrangian
- H(p) - ?x px log px
- Lagrangian
- Optimize the Lagrangian
108- For fixed ?, ? has a maximum where
109Training
- Improved Iterative Scaling (IIS) Algorithm
110Finding optimal ? by iteration
- The change in log-likelihood from ? to ?
111Finding optimal ? by iteration 2
- Use inequality log ? ? 1- ?
where
therefore
112Finding optimal ? by iteration 3
- By the Jensens inequality (cf. appendix)
exp?p(x)q(x)? ? p(x) exp q(x)
where
113Finding optimal ? by iteration 4
- call it B(?) and differentiating it
- ?i appears alone, can solve each ?i
114Improved Iterative Scaling Algorithm
- IIS Algorithm
- Start with some value for each ?i
- Repeat until convergence
- find each ?i by solve the equation
-
- Set ?i lt- ?i ?i
115Applications
- All sequential labeling problems
- Natural language processing
- NER, POS tagging, Chunking
- Speech recognition
- Graphics
- Noise reduction
- Many others
116Reference
- A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A
maximum entropy approach to natural language
processing, 1996 - A. Berger, The Improved Iterative Scaling
Algorithm A Gentle Introduction, 1997