What's Happened Since the First SIGDAT Meeting - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

What's Happened Since the First SIGDAT Meeting

Description:

but didn't appreciate just how hot they would turn out to be. Sister meetings have also done very well since 1993. Information Retrieval ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 36
Provided by: kenneth80
Category:

less

Transcript and Presenter's Notes

Title: What's Happened Since the First SIGDAT Meeting


1
What's Happened Since the First SIGDAT Meeting?
  • Kenneth Ward Church
  • ATT Labs-Research
  • kwc_at_research.att.com

2
The First SIGDAT Meeting
  • WVLC-1 was held just before ACL-93
  • Great turnout!
  • More like a conference than a workshop
  • We knew that corpora were hot,
  • but didn't appreciate just how hot they would
    turn out to be.

3
Sister meetings have also done very well since
1993
  • Information Retrieval
  • http//www.acm.org/sigir/
  • Digital Libraries
  • http//fox.cs.vt.edu/DL99/
  • Machine Learning
  • http//www.cs.cmu.edu/Web/Groups/NIPS
  • Data-mining, Databases, Data Warehousing
  • http//www.acm.org/sigkdd/
  • http//www.vldb.org/

4
Empiricism has a long history
  • In the 1950s, empiricism dominated a broad set
    of fields
  • from psychology (behaviorism)
  • to electrical engineering (information theory).
  • At the time, it was common practice in
    linguistics to classify words not only on the
    basis of their meanings
  • but also on the basis of their co-occurrence with
    other words.
  • You shall know a word by the company it keeps
    (Firth, 1957)
  • Regrettably, interest in empiricism faded in the
    1960s
  • Chomsky's criticism of ngrams in Syntactic
    Structures (1957) and
  • Minsky and Papert's criticism of neural nets in
    Perceptrons (1969).

5
1990s Revival
  • Empiricism regained a dominant position
  • Ngrams and Hidden Markov Models (HMMs) became the
    method of choice in Speech.
  • Neural Networks (Perceptrons Hidden Layers)
    helped create Machine Learning.
  • Empiricism ? Rationalism ? Empiricism
  • Oscillates about once a career
  • Mark Twain Grandparents and Grandchildren have a
    natural alliance.

6
Why the Revival?It was a bad idea then, and it
is still a bad idea now
  • More powerful computers??
  • Availability of massive quantities of data!!
  • Text is available like never before.
  • Not long ago, the Brown Corpus was considered
    large.
  • But now, text is available like never before!
  • First came collection efforts (www.ldc.upenn.org),
  • And now everyone has access to the Web!
  • Experiments are routinely carried out on
    gigabytes of text.
  • Some researchers are even working with terabytes.

7
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

8
The Web, Stupid!
  • If you publish a paper about neat stuff, it is
    expected that you will post it on the web.
  • Ill mention just a few examples of neat stuff on
    the web.
  • Demos
  • Data
  • Tools

9
Lots of Neat Demos on the Web
  • Web Searching with Machine Translation
  • www.altavista.com(uses Systran)
  • Cross-Language Information Retrieval (CLIR)
  • www.xrce.xerox.com
  • Parallel Corpora www-rali.iro.umontreal.ca
  • Latent Semantic Indexing (LSI)
  • superbook.bellcore.com/remde/lsi
  • lsa.colorado.edu
  • Speech Synthesis www.bell-labs.com/project/tts
  • Dotplot www.cs.unm.edu/jon/dotplot

10
Lots of Neat Data on the Web
  • Wordnet www.cogsci.princeton.edu/wn
  • Linguistic Data Consortium (LDC)
  • www.ldc.upenn.org
  • SIGLEX www.clres.com/siglex.html
  • Discourse Resource Initiative (DRI)
  • www.georgetown.edu/luperfoy/Discourse-Treebank/dri
    -home.html
  • The Federalist Papers
  • www.mcs.net/knautzr/fed

11
More Neat Data on the Web(in Lots of Languages)
  • Chinese
  • rocling.iis.sinica.edu.tw
  • www.sinica.edu.tw
  • Japanese cl.aist-nara.ac.jp/lab/resource/resource
    .html
  • Electronic Dictionary Research (EDR)
    www.iijnet.or.jp/edr
  • Advanced Telecommunications Research (ATR)
    www.atr.co.jp
  • www.rdt.monash.edu.au/jwb/japanese.html
  • Korean korterm.kaist.ac.kr
  • European Language Resources Association (ELRA)
  • www.icp.grenet.fr/ELRA
  • Parallel Text (Resnik, ACL-99)
  • Canadian Hansards WWW.Parl.GC.CA
  • Turkish www.nlp.cs.bilkent.edu.tr
  • Swedish svenska.gu.se

12
Lots of Neat Tools on the Web
  • Penntools (links to all over the world)
  • www.cis.upenn.edu/adwait/penntools.html
  • Part of Speech Taggers (see above)
  • Juman/Chasen
  • pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html
  • cl.aist-nara.ac.jp/lab/nlt/chasen.html
  • Suffix Arrays
  • http//cm.bell-labs.com/cm/cs/who/doug/ssort.c

13
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

14
Shared Resources Evaluation
  • Common tasks
  • Trec (trec.nist.gov), Tipster, MUC
  • Common benchmark corpora Brown, Penn Treebank,
    Wall Street Journal, Switchboard
  • Shared lexical resources Wordnet
    (www.cogsci.princeton.edu/wn/)
  • Common labeling conventions/standards in all
    areas of NLP from Speech to Discourse
  • Evaluation, evaluation, evaluation
  • Required to get a paper accepted anywhere.

15
In 1993, it wasnt like this...
  • Invited talks at ACL-93
  • Planning Multimodal Discourse
  • Transfers of Meaning
  • Quantificational Domains and Recursive Contexts
  • Less sharing of resources
  • Evaluation not required

16
Empiricism vs. Rationalism
  • Pluses Clear measurable progress
  • Speech Recognition
  • Part of Speech Tagging
  • Parsing
  • Minuses Herd mentality, incrementalism, mindless
    metrics, duplicated effort
  • Recall empiricism fell out of favor in 1960s
    when methodology became too burdensome.

17
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

18
Main StreetBig change since 1993
  • Large corpora are now having an impact on
    ordinary users
  • Web search engines/portals
  • Managing gigabytes, not just a popular book, but
    something that ordinary users are beginning to
    take for granted.

19
Huge Commercial Successes(Since 1993)
  • Information Retrieval Digital Libraries
  • Web search engines/portals highly successful on
    both Wall Street as well as Main Street
  • Invited talks from Lycos (1997) Infoseek (1998)
  • Machine Translation Speech
  • Available wherever software is sold
  • Cant use a phone without talking to a computer

20
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

21
How Large is Very Large?
22
Mirror, mirror on the wall
  • Who is the largest of them all?
  • The Web?
  • Lexis-Nexis?
  • West?
  • We have had invited talks from all three
  • Web Lycos (1997) Infoseek (1998)
  • Lexis-Nexis (1993)
  • West (1997)

23
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

24
Internationalization
  • SIGDAT-93 Nearly equal participation
  • America 4 papers
  • Asia 4 papers
  • Europe 3 papers
  • Great growth in activity around the world,
    especially Asia
  • SIGDAT has met in a dozen cities (50 in America)
  • America Columbus, Cambridge, Philadelphia,
    Providence, Montreal, College Park
  • Asia Kyoto, Beijing, Hong Kong
  • Europe Dublin, Copenhagen, Grenada

25
Some Topics that are Behind the International
Expansion
  • Classic Issues
  • Machine Translation (MT) / Tools
  • Input Method Editor (IME) MS-IME98
  • Morphology Juman, Chasen
  • New Issues
  • Cross-language Information Retrieval (CLIR)
  • Browsing the Internet integrate IME CLIR MT
  • Parallel and comparable corpora
  • Terminology Extraction Alignment
  • Suffix Arrays

26
Big Changes Since 1993
  • The Web, stupid!
  • Demos
  • Data
  • Research
  • Shared resources evaluation
  • Scale How large is very large?
  • Increased breadth Geography, Topics
  • Commercial Wall Street Main Street

27
Broader (and More Applied) View of Computational
Linguistics
  • Data-mining, Databases, Data Warehousing
  • Digital Libraries
  • Information Retrieval, Categorization, Extraction
  • Lexicography
  • Machine Learning
  • Machine Translation
  • Speech
  • Text Analysis

28
Data-Mining Issues(How Large is Very Large?)
  • Similar technology to corpus-based methods
  • But much larger datasets
  • Newswire (AP) 1 million words per week
  • Telephone calls 1-10 billion per month
  • IP packets expected to be even larger
  • Tasks Fraud, Marketing, Operations, Care
  • Identify knobs that business partners can turn
  • Increase demand (buy TV ads, reduce price)
  • Increase supply (buy network capacity, enhance
    operations)
  • Target opportunities for improvement (marketing
    prospects)
  • Track market response in real time (supply/demand
    by knob)

29
Best of SIGDAT
  • Best Invited Talk
  • Work of Note
  • Work of Note (in Related Fields)

30
Best Invited Talkat a SIGDAT Meeting
  • Henry Kucera and Nelson Francis
  • Third Workshop on Very Large Corpora (1995)
  • Massachusetts Institute of Technology (MIT)
  • Cambridge, MA, USA
  • Described their work on the Brown Corpus
  • At a time when empiricism was out of fashion
  • especially at MIT
  • Personal Touching (received standing ovation)

31
Work of Note
  • Statistical Machine Translation / Alignment
  • Brown et al.
  • Statistical Parsing (In 1993, poor use of lexical
    info)
  • Jelinek, Magerman, Charniak, Collins
  • Statistical PP Attachment
  • Hindle and Rooth
  • Word-sense Disambiguation
  • Yarowsky
  • Text-tiling (Discourse Parsing)
  • Hearst

32
Work of Note(in Related Fields)
  • Learning
  • Classification and Regression Trees (CART)
  • Riper
  • Web Tools
  • Managing Gigabytes, Harvest, SGML ? XML
  • Representation
  • Suffix Arrays
  • Latent Semantic Indexing

33
SummaryReaching a Wider Audience
  • Commercial Successes
  • Main Street Wall Street
  • Internationalization
  • Goal equal rep from America, Asia Europe
  • More topic areas
  • Information Retrieval, Speech, Machine
    Translation, Machine Learning, Data-mining

34
Self-organizing vs. EDA
  • Self-organizing Learning, HMM
  • Statistics do it all
  • Manual
  • Wilks Stone Soup Statistics dont do nothing
  • Exploratory Data Analysis (EDA)
  • Hybrid of above

35
Time for a little controversyTwo types of
Empiricism
  • New Linguistic Insights vs. Methodology
  • Reviewers do what reviewers do
  • Safe, conservative, seek precedents, case law
  • Reviewers go easy on methodology papers
  • Grim historical reminder
  • Recall empiricism fell out of favor in 1960s
    when methodology became too burdensome.
  • Shouldnt let the methodology get in the way of
    what we are here to do.
Write a Comment
User Comments (0)
About PowerShow.com