Title: Building Classifiers with JavaNLP
1Building Classifiers with JavaNLP
- Kristina Toutanova
- Feb 7, 2005
2Overview
- Example Problem
- Log-linear Type 1 classifier using binary
features - Log-linear Type 1 classifier using real valued
features - Log-linear type 2 classifier using Type2Datum
- Binary and Real Valued
- More memory efficient ..
- More classifiers
- Extensions for Type 1 log-linear classifier
- Naïve Bayes, SVM
- Products of Experts
3Classify Package (and Maxent)
- Provides facilities for training classifiers
- The data points are viewed as single instances
(not sequences) - Most used classifier log-linear classifier with
binary features - Other classifiers SVM, Naïve Bayes
- Concentrate on log-linear classifiers
-
4Example Problem
- Restricted part-of-speech tagging
- Classify words without context in noun or verb
parts of speech NN,NNS,VB,VBP,VBZ,VBD,VBN - Features identity of word and prefixes and
suffixes of up to certain length - years NNS
- featuresWORDyears, PREFIX1y, SUFFIX1s
- labelNNS
5Log-linear classifier with binary features
- View the features as binary valued
- years NNS
- features WORDyears PREFIX1y SUFFIX1s have
value 1 - all other features have value 0
- Define a method to make data instances
- years NNS -gt Datum
- Put the instances into a Dataset
- Create a log-linear classifier factory
- Get a classifier from the factory
6Log-linear classifier with binary features
Making Datums and Datasets
- Datum is an interface that can return the
features as a collection and is labeled - Use BasicDatum which implements Datum
- Collection featuresnew ArrayList()
- features.add(WORDyears)features.add(PREFIX1y
)features.add(SUFFIX1s) - Datum dnew BasicDatum(features,NNS)
- Dataset dataSetnew Dataset()
- dataSet.add(d)
- Useful methods in Dataset
- dataSet.applyFeatureCountThreshold(int cutoff)
- dataSet.summaryStatistics() dumps the number of
features and datums
7Log-linear classifier with binary features
Making a factory and training
- LinearClassifier trainClassifier(Dataset
dataSet) - LinearClassifierFactory lcFactory new
LinearClassifierFactory(new QNMinimizer(5), 1e-3,
false, 1.0) - return (LinearClassifier)
lcFactory.trainClassifier(dataSet) -
- LinearClassifierFactory has many constructors
depending on options - Type of optimizer
- default is QNMinimizer, shouldnt need to change
it - Type of prior
- default is quadratic, shouldnt need to change it
- Memory for QNMinimizer, tolerance
- Value of regularization standard deviation sigma
- Could be worth exploring different values
- There are methods for fitting with
cross-validation and held-out dataset
8Testing the Classifier
- We made a LinearClassifier
- The linear classifier has a feature vector for
every possible label occurring in the training
set - This is a multi-class logistic regression
classifier - LinearClassifier implements Classifier
- Object classOf(Datum d)
- Counter scoresOf(Datum d) - un-normalized scores
- Other useful methods
- justificationOf(Datum d)
- logProbabilityOf(Datum d)
- Various methods for visualizing the weights and
the most highly weighted features
9Beyond Binary Features
- Suppose we want to weight differently the
features for different length prefixes and
suffixes - We want to give less weight to the longer
prefixes/suffixes to avoid overfitting shown
useful for string kernels - years NNS
- WORDyears value 1
- PREFIX1y SUFFIX1s value 1
- PREFIX2ye SUFFIX2rs value .5
- PREFIX3yea SUFFIX3ars value .25
- Use RVFDatum
- Counter cnew Counter()
- c.incrementCount(PREFIX2ye,.5) .
- RVFDatum dnew RVFDatum(c,NNS)
10Beyond Binary Features Datasets and Classifiers
- RVFDataset dataSetnew RVFDataset()
- dataSet.add(rvfDatum)
- LinearClassifier trainClassifier(RVFDataset
dataSet) - LinearClassifierFactory lcFactory new
LinearClassifierFactory(new QNMinimizer(5), 1e-3,
false, 1.0) - return (LinearClassifier)lcFactory.trainClas
sifier(dataSet) -
- A LinearClassifier is an RVFClassifier
11Using LinearType2Classifier in Maxent
- The framework of LinearClassifiers builds
multi-class logistic regression models one
parameter vector for each class - Sometimes it is useful to have parameter sharing
among classes when some groups of classes have
common properties - For the tagging problem
- Add features that are active for any of the noun
classes NN and NNS - Add features that are active for any of the verb
classes VB,VBP,VBZ,VBD,VBN - Example PREFIX4keep and LABEL(VB or VBZ or VBN
or VBD or VBP)
12Building LinearType2Classifiers
- Make Type2Datum from the instances
- GeneralizedCounter featuresnew
GeneralizedCounter(2) - Datum dmakeDatum(word,label)
- Collection feat1Dd.asFeatures()
- for(Iterator featIterfeat1D.iterator()featI
ter.hasNext()) - Object feature1DfeatIter.next()
- for(int i0iltlabelIndex.size()i)
- features.incrementCount2D(integers.get(i),
new Pair(labelIndex.get(i),feature1D)) - if(isNoun(i))
- features.incrementCount2D(integers.get(i
),new Pair(noun,feature1D)) - else
- features.incrementCount2D(integers.get(i
),new Pair(verb,feature1D)) -
-
-
- Type2Datum dat new Type2Datum(features,integ
ers.get(labelIndex.indexOf(label)))
13Building LinearType2Classifiers
- LinearType2Classifier trainClassifier(Type2Dataset
dataSet) - return LinearType2Classifier.trainClassifier(d
ataSet,dataSet.featureIndex()) -
- A static method for training no factory
- There is presently no good interface for
specifying the tolerance, sigma - Uses QNMinimizer with default sigma 1
14LinearType2Classifiers Real Valued Features
- The features are real valued by default
- protected Type2Datum makeWeightedType2Datum(Objec
t word,Object label) - GeneralizedCounter featuresnew
GeneralizedCounter(2) - RVFDatum dmakeRVFDatum(word,label)
- Counter feat1Dd.asFeatures()
- for(Iterator featIterfeat1D.keySet().iterato
r()featIter.hasNext()) - Object feature1DfeatIter.next()
- for(int i0iltlabelIndex.size()i)
- features.incrementCount2D(integers.get(i),
new Pair(labelIndex.get(i),feature1D),feat1D.getCo
unt(feature1D)) - if(isNoun(i))
- features.incrementCount2D(integers.get(i
),new Pair(noun,feature1D),feat1D.getCount(feature
1D)) - else
- features.incrementCount2D(integers.get(i
),new Pair(verb,feature1D),feat1D.getCount(feature
1D)) -
-
-
-
15More Classes for Efficiency
- There is a more memory-efficient representation
which works in special cases such as the tagging
example where many features repeat across
different labels - Type2Corpus
- addInstance(SectionedType2Datum datum)
- More efficient log-linear classifier for 2 classes