Text Representation - PowerPoint PPT Presentation

About This Presentation
Title:

Text Representation

Description:

Ning Yu School of Library and Information Science Indiana University at Bloomington Text Representation & Text Classification for Intelligent Information Retrieval – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 22
Provided by: YuNi9
Category:

less

Transcript and Presenter's Notes

Title: Text Representation


1
Text Representation Text Classification for
Intelligent Information Retrieval
  • Ning Yu
  • School of Library and Information Science
  • Indiana University at Bloomington

2
Outline
  • The big picture
  • A specific problem opinion detection

3
Intelligent information retrieval
  • Characteristics
  • Not restricted to keyword matching and Boolean
    search
  • Deal with natural language query and advanced
    search criteria
  • Coarse-to-fine level of granularity
  • Automatically organize/evaluate/interpret
    solution space
  • User-centered, e.g., adapt to users learning
    habit
  • Etc.

4
Intelligent information retrieval
  • System Preferences
  • Various source of evidence
  • Natural language processing
  • Semantic web technologies
  • Automatic text classification
  • Etc.

5
Intelligent IR system diagram
6
A Specific QuestionSemi-Supervised Learning
for Identifying Opinions in Web Content
  • Dissertation work

7
Growing demand for online opinions
  • Enormous body of user-generated content
  • About anything, published anywhere and at any
    time
  • Useful for literature review, decision making,
    market monitoring, etc.

8
Major approaches for opinion detection
9
Whats Essential?Labeled Data! And lots of
them!!!
  • To acquire a broad and comprehensive collection
    of opinion-bearing features (e.g., bag-of-words,
    POS words, N-grams (ngt1), linguistic
    collocations, stylistic features, contextual
    features)
  • To generate complex patterns (e.g., good
    amount) that can approximate the context of
    words.
  • To generate and evaluate opinion detection
    systems
  • To allow evaluation of opinion detection
    strategies with high confidence

9
10
Challenges for opinion detection
  • Shortage of opinion-labeled data manual
    annotation is tedious, error-prone and difficult
    to scale up Domain transfer strategies designed f
    or opinion detection in one data domain generally
    do not perform well in another domain

11
Motivations research question
  • Easy to collect unlabeled user-generated content
    that contains opinions
  • Semi-Supervised Learning (SSL) requires only a
    limited number of labeled data to automatically
    label unlabeled data has achieved promising
    results in NLP studies
  • Is SSL effective in opinion detection both in
    sparse data situations and for domain adaptation?

12
Datasets data split
Dataset (sentences) Blog Posts Movie Reviews News Articles
Opinion 4,843 5,000 5,297
Non-opinion 4,843 5,000 5,174
13
Two major SSL methods Self-training
  • Assumption Highly confident predictions made by
    an initial opinion classifier are reliable and
    can be added to the labeled set.
  • Limitation Auto-labeled data may be biased by
    the particular opinion classifier.

14
Two major SSL methods Co-training
  • Assumption Two opinion classifiers with
    different strengths and weaknesses can benefit
    from each other.
  • Limitation It is not always easy to create two
    different classifiers.

15
Experimental design
  • General settings for SSL
  • Naïve Bayes classifier for self-training
  • Binary values for unigram and bigram features
  • Co-training strategies
  • Unigrams and bigrams (content vs. context)
  • Two randomly split feature/training sets
  • A character-based language model (CLM) and a
    bag-of-words model (BOW)

16
Results Overall
  • For movie reviews and news articles, co-training
    proved to be most robust
  • For blog posts, SSL showed no benefits over SL
    due to the low initial accuracy

17
Results Movie reviews
  • Both self-training and co-training can improve
    opinion detection performance
  • Co-training is more effective than self-training

18
Results Movie reviews (cont.)
  • The more different the two classifiers, the
    better the performance

19
Results Domain transfer

(movie
reviews-gtblog posts)
  • For a difficult domain (e.g., blog), simple
    self-training alone is promising for tackling the
    domain transfer problem.

20
Contributions
  • Comprehensive research expands the spectrum of
    SSL application to opinion detection
  • Investigation of SSL model that best fits the
    problem space extends understanding of opinion
    detection and provides a resource for
    knowledge-based representation
  • Generation of guidelines and evaluation baselines
    advances later studies using SSL algorithms in
    opinion detection
  • Research extensible to other data domains,
    non-English texts, and other text mining tasks

21
Thank you!
If you want a second opinion, Ill ask my
computer
www.CartoonStock.com
21
Write a Comment
User Comments (0)
About PowerShow.com