Title: Domain Adaptation in Natural Language Processing
1Domain Adaptation in Natural Language Processing
- Jing Jiang
- Department of Computer Science
- University of Illinois at Urbana-Champaign
2Textual Data in the Information Age
- Contains much useful information
- E.g. gt85 corporate data stored as text
- Hard to handle
- Large amount e.g. by 2002, 2.5 billion documents
on surface Web, 7.3 million / day - Diversity emails, news, digital libraries, Web
logs, etc. - Unstructured vs. relation databases
How to manage textual data?
3- Information retrieval to rank documents based on
relevance to keyword queries - Not always satisfactory
- More sophisticated services desired
4Automatic Text Summarization
5Question Answering
6Information Extraction
7Beyond Information Retrieval
- Automatic text summarization
- Question answering
- Information extraction
- Sentiment analysis
- Machine translation
- Etc.
All relies on Natural Language Processing (NLP)
techniques
to deeply understand and analyze text
8Typical NLP Tasks
- Larry Page was Googles founding CEO
- Part-of-speech tagging
- Larry/noun Page/noun was/verb Google/noun
s/possessive-end founding/adjective CEO/noun - Chunking
- NP Larry Page V was NP Google s founding
CEO - Named entity recognition
- person Larry Page was organization Google
s founding CEO - Relation extraction
- Founder(Larry Page, Google)
- Word sense disambiguation
- Larry Page vs. Page 81
state-of-the-art solution supervised machine
learning
9Supervised Learning for NLP
representative corpus
human annotation
WSJ articles
POS-tagged WSJ articles
Larry/NNP Page/NNP was/VBD Google/NNP s/POS
founding/ADJ CEO/NN
Standard Supervised Learning Algorithm
training
part-of-speech tagging on news articles
trained POS tagger
10In Reality
X human annotation is expensive
representative corpus
human annotation
MEDLINE articles
POS-tagged MEDLINE articles
POS-tagged WSJ articles
We/PRP analyzed/VBD the/DT mutations/NNS of/IN
the/DT H-ras/NN genes/NNS
Standard Supervised Learning Algorithm
training
part-of-speech tagging on biomedical articles
trained POS tagger
11Many Other Examples
- Named entity recognition
- News articles ? personal blogs
- Organism A ? organism B
- Spam filtering
- Public email collection ? personal inboxes
- Sentiment analysis of product reviews (positive
vs. negative) - Movies ? books
- Cell phones ? digital cameras
Problem with this non-standard setting with
domain difference?
12Domain Difference? Performance Degradation
ideal setting
POS Tagger
MEDLINE
MEDLINE
96
realistic setting
POS Tagger
MEDLINE
WSJ
86
13Another Example
ideal setting
gene name recognizer
54.1
realistic setting
gene name recognizer
28.1
14Domain Adaptation
source domain
target domain
Labeled
Labeled
Unlabeled
to design learning algorithms that are aware of
domain difference and exploit all available data
to adapt to the target domain
Domain Adaptive Learning Algorithm
15With Domain Adaptation Techniques
standard learning
gene name recognizer
Yeast
Fly Mouse
63.3
domain adaptive learning
gene name recognizer
Yeast
Fly Mouse
75.9
16Roadmap
- What is domain adaptation in NLP?
- Our work
- Overview
- Instance weighting
- Feature selection
- Summary and future work
17Overview
Source Domain
Target Domain
18Ideal Goal
Source Domain
Target Domain
19Standard Supervised Learning
Source Domain
Target Domain
20Standard Semi-Supervised Learning
Source Domain
Target Domain
21Idea 1 Generalization
Source Domain
Target Domain
22Idea 2 Adaptation
Source Domain
Target Domain
23Source Domain
Target Domain
How to formally formulate the ideas?
24Instance Weighting
instance space (each point represents an observed
instance)
Source Domain
Target Domain
to find appropriate weights for different
instances
25Feature Selection
feature space (each point represents a useful
feature)
Source Domain
Target Domain
to separate generalizable features from
domain-specific features
26Roadmap
- What is domain adaptation in NLP?
- Our work
- Overview
- Instance weighting
- Feature selection
- Summary and future work
27Observation
source domain
target domain
28Observation
source domain
target domain
29Analysis of Domain Difference
x observed instance y class label (to be
predicted)
p(x, y)
ps(y x) ? pt(y x)
p(x)p(y x)
ps(x) ? pt(x)
labeling difference
instance difference
?
labeling adaptation
instance adaptation
30Labeling Adaptation
source domain
target domain
pt(y x) ? ps(y x)
remove/demote instances
31Labeling Adaptation
source domain
target domain
pt(y x) ? ps(y x)
remove/demote instances
32Instance Adaptation (pt(x) lt ps(x))
source domain
target domain
pt(x) lt ps(x)
remove/demote instances
33Instance Adaptation (pt(x) lt ps(x))
source domain
target domain
pt(x) lt ps(x)
remove/demote instances
34Instance Adaptation (pt(x) gt ps(x))
source domain
target domain
pt(x) gt ps(x)
promote instances
35Instance Adaptation (pt(x) gt ps(x))
source domain
target domain
pt(x) gt ps(x)
promote instances
36Instance Adaptation (pt(x) gt ps(x))
source domain
target domain
pt(x) gt ps(x)
- Target domain instances are useful
37Empirical Risk Minimization with Three Sets of
Instances
Dt, l
Dt, u
Ds
loss function
use empirical loss to replace expected loss
optimal classification model
expected loss
38Using Ds
Dt, l
Dt, u
Ds
X?Ds
instance difference (hard for high-dimensional
data)
labeling difference (need labeled target data)
39Using Dt,l
Dt, l
Dt, u
Ds
X?Dt,l
small sample size estimation not accurate
40Using Dt,u
Dt, l
Dt, u
Ds
X?Dt,u
use predicted labels (bootstrapping)
41Combined Framework
a flexible setup covering both standard methods
and new domain adaptive methods
42Experiments
- NLP tasks
- POS tagging WSJ (Penn TreeBank) ? Oncology
(biomedical) text (Penn BioIE) - NE type classification newswire ? conversational
telephone speech (CTS) and web-log (WL) (ACE
2005) - Spam filtering public email collection ?
personal inboxes (u01, u02, u03) (ECML/PKDD 2006) - Three heuristics to partially explore the
parameter settings
43Instance Pruningremoving misleading instances
from Ds
POS
NE Type
Spam
44Dt,l with Larger Weights
POS
NE Type
Dt,l is very useful promoting Dt,l is more useful
Spam
45Bootstrapping with Larger Weightsuntil Ds and
Dt,u are balanced
POS
NE Type
promoting target instances is useful, even with
predicted labels
Spam
46Roadmap
- What is domain adaptation in NLP?
- Our work
- Overview
- Instance weighting
- Feature selection
- Summary and future work
47Observation 1Domain-specific features
wingless daughterless eyeless apexless
48Observation 1Domain-specific features
wingless daughterless eyeless apexless
- describing phenotype in fly gene nomenclature
- feature -less useful for this organism
CD38 PABPC5
feature still useful for other organisms?
No!
49Observation 2Generalizable features
50Observation 2Generalizable features
feature X be expressed
51Assume Multiple Source Domains
source domains
target domain
Labeled
Unlabeled
Domain Adaptive Learning Algorithm
52Detour Logistic Regression Classifiers
0 1 0 0 1 0 1 0
0.2 4.5 5 -0.3 3.0 2.1 -0.9 0.4
-less
p binary features
X be expressed
and wingless are expressed in
x
wyT x
wy
53Learning a Logistic Regression Classifier
0 1 0 0 1 0 1 0
0.2 4.5 5 -0.3 3.0 2.1 -0.9 0.4
regularization term
penalize large weights control model complexity
log likelihood of training data
wyT x
54Generalizable Features in Weight Vectors
D1
D2
DK
K source domains
0.2 4.5 5 -0.3 3.0 2.1 -0.9 0.4
3.2 0.5 4.5 -0.1 3.5 0.1 -1.0 -0.2
0.1 0.7 4.2 0.1 3.2 1.7 0.1 0.3
domain-specific features
generalizable features
w1
w2
wK
55Decomposition of wk for Each Source Domain
shared by all domains
domain-specific
a matrix that selects generalizable features
0.2 4.5 5 -0.3 3.0 2.1 -0.9 0.4
0.2 4.5 0.4 -0.3 -0.2 2.1 -0.9 0.4
0 0 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0
4.6 3.2 3.6
wk AT v
uk
56Framework for Generalization
Fix A, optimize
regularization term
log likelihood of labeled data from K source
domains
wk
?s gtgt 1 to penalize domain-specific features
57Framework for Adaptation
Fix A, optimize
log likelihood of target domain examples with
predicted labels
?t 1 ltlt ?s to pick up domain-specific
features in the target domain
58How to Find A? (1)
59How to Find A? (2)
- Domain cross validation
- Idea training on (K 1) source domains and
validate on the held-out source domain - Approximation
- wfk weight for feature f learned from domain k
- wfk weight for feature f learned from other
domains - rank features by
60Intuition for Domain Cross Validation
domains
D1
D2
Dk-1
Dk (fly)
w 1.5 0.05
w 2.0 1.2
expressed -less
-less expressed
expressed -less
1.8 0.1
w1
w2
product of w1 and w2
61Experiments
- Data set
- BioCreative Challenge Task 1B
- Gene/protein recognition
- 3 organisms/domains fly, mouse and yeast
- Experimental setup
- 2 organisms for training, 1 for testing
- F1 as performance measure
62Experiments Generalization
using generalizable features is effective
F fly M mouse Y yeast
domain cross validation is more effective than
joint optimization
63Experiments Adaptation
F fly M mouse Y yeast
domain-adaptive bootstrapping is more effective
than regular bootstrapping
64Related Work
- Problem relatively new to NLP and ML communities
- Most related work developed concurrently with our
work
65Roadmap
- What is domain adaptation in NLP?
- Our work
- Overview
- Instance weighting
- Feature selection
- Summary and future work
66Summary
- Domain adaptation is a critical novel problem in
natural language processing and machine learning - Contributions
- First systematic formal analysis of domain
adaptation - Two novel general frameworks, both shown to be
effective - Potentially applicable to other classification
problems outside of NLP - Future work
- Domain difference measure
- Unify two frameworks
- Incorporate domain knowledge into adaptation
process - Leverage domain adaptation to perform large-scale
information extraction on scientific literature
and on the Web
67Information Extraction System
Entity Recognition
Relation Extraction
Domain Adaptive Learning
Intelligent Learning
Knowledge Resources Exploitation
Interactive Expert Supervision
Existing Knowledge Bases
Labeled Data from Related Domains
Domain Expert
68Hypothesis Generation
Inference Engine
Applications
Pathway Construction
Knowledge Base Curation
Biomedical Literature (MEDLINE abstracts,
full-text articles, etc.)
Entity Recognition
DWnt-2 is expressed in somatic cells of the gonad
throughout development.
Relation Extraction
expression relations
Information Extraction System
Extracted Facts
69Applications (cont.)
- Similar ideas for Web text mining
- Product reviews
- Existing annotated reviews limited (certain
products from certain sources) - Large amount of semi-structured reviews from
review websites - Unstructured reviews from personal blogs
70Selected Publications
this talk
- J. Jiang C. Zhai. A two-stage approach to
domain adaptation for statistical classifiers.
In CIKM07. - J. Jiang C. Zhai. Instance weighting for
domain adaptation in NLP. In ACL07. - J. Jiang C. Zhai. Exploiting domain structure
for named entity recognition. In HLT-NAACL06. - J. Jiang C. Zhai. A systematic exploration of
the feature space for relation extraction. In
NAACL-HLT07. - J. Jiang C. Zhai. Extraction of coherent
relevant passages using hidden Markov models.
ACM Transactions on Information Systems (TOIS),
Jul 2006. - J. Jiang C. Zhai. An empirical study of
tokenization strategies for biomedical
information retrieval. Information Retrieval,
Oct 2007. - X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai B.
Schatz. Generating semi-structured gene
summaries from biomedical literature.
Information Processing Management, Nov 2007. - X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai B.
Schatz. Automatically generating gene summaries
from biomedical literature. In PSB06.
feature exploration for relation extraction
information retrieval
gene summarization