Title: Text segmentation in Informedia
1Text segmentation in Informedia
Faculty Mentor Alex Hauptmann
TA Mentor Vandi Verma
Students Zhirong Wang Ningning HuJichuan Chang
2Data and Methods
- Data
- CNN WorldView (01/1999-10/2000)
- Stemming, merging, stop words removal,
- Methods
- Classification
- Artificial Neural Network (sentence)
- Naive Bayes (sentence/fixed length window)
- SVM (sentence)
- Topic change detection
- EM clustering
- topics, block size
001630 CENTURY gtgtgt WE PEOPLE TEND TO 001631 PUT
THINGS LIKE THE PASSING OF A 001633 MILLENIUM IN
SHARP FOCUS. WE 001633 CELEBRATE, CONTEMPLATE,
EVEN 001635 WORRY A BIT, SOMETIMES WORRY A
001636 LOT. AFTER ALL, IT'S SOMETHING 001638
THAT HAPPENS ONLY ONCE EVERY ONE 001641 THOUSAND
YEARS. A BIG DEAL? 001641 PERHAPS NOT TO ALL
LIVING THINGS, 001642 AS CNN'S RICHARD BLYSTONE
001643 FOUND OUT WHEN HE CONSIDERED ONE 001654
VERY OLD TREE. gtgtgt HO HUM. 001654 ANOTHER
MILLENNIUM. THE GREAT YEW
3Experimental Result
Identified boundary
Sentences
Reference boundary
False Alarm
Miss
OK
OK
OK
Recall (OK) / (OK Miss) Precision (OK)
/ (OK False Alarm)
- Feature selection
- Block size
- Best Classifier
- Naive Bayes Classifier
- Fixed length block
4Discussion
- Impact of data set
- Good recall, lower precision
- Noisy close-captioning text
- Ratio of positive to negative examples
- Combining different classifiers
- Different granularity
- Voting