Title: What's Happened Since the First SIGDAT Meeting
1What's Happened Since the First SIGDAT Meeting?
- Kenneth Ward Church
- ATT Labs-Research
- kwc_at_research.att.com
2The First SIGDAT Meeting
- WVLC-1 was held just before ACL-93
- Great turnout!
- More like a conference than a workshop
- We knew that corpora were hot,
- but didn't appreciate just how hot they would
turn out to be.
3Sister meetings have also done very well since
1993
- Information Retrieval
- http//www.acm.org/sigir/
- Digital Libraries
- http//fox.cs.vt.edu/DL99/
- Machine Learning
- http//www.cs.cmu.edu/Web/Groups/NIPS
- Data-mining, Databases, Data Warehousing
- http//www.acm.org/sigkdd/
- http//www.vldb.org/
4Empiricism has a long history
- In the 1950s, empiricism dominated a broad set
of fields - from psychology (behaviorism)
- to electrical engineering (information theory).
- At the time, it was common practice in
linguistics to classify words not only on the
basis of their meanings - but also on the basis of their co-occurrence with
other words. - You shall know a word by the company it keeps
(Firth, 1957) - Regrettably, interest in empiricism faded in the
1960s - Chomsky's criticism of ngrams in Syntactic
Structures (1957) and - Minsky and Papert's criticism of neural nets in
Perceptrons (1969).
51990s Revival
- Empiricism regained a dominant position
- Ngrams and Hidden Markov Models (HMMs) became the
method of choice in Speech. - Neural Networks (Perceptrons Hidden Layers)
helped create Machine Learning. - Empiricism ? Rationalism ? Empiricism
- Oscillates about once a career
- Mark Twain Grandparents and Grandchildren have a
natural alliance.
6Why the Revival?It was a bad idea then, and it
is still a bad idea now
- More powerful computers??
- Availability of massive quantities of data!!
- Text is available like never before.
- Not long ago, the Brown Corpus was considered
large. - But now, text is available like never before!
- First came collection efforts (www.ldc.upenn.org),
- And now everyone has access to the Web!
- Experiments are routinely carried out on
gigabytes of text. - Some researchers are even working with terabytes.
7Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
8The Web, Stupid!
- If you publish a paper about neat stuff, it is
expected that you will post it on the web. - Ill mention just a few examples of neat stuff on
the web. - Demos
- Data
- Tools
9Lots of Neat Demos on the Web
- Web Searching with Machine Translation
- www.altavista.com(uses Systran)
- Cross-Language Information Retrieval (CLIR)
- www.xrce.xerox.com
- Parallel Corpora www-rali.iro.umontreal.ca
- Latent Semantic Indexing (LSI)
- superbook.bellcore.com/remde/lsi
- lsa.colorado.edu
- Speech Synthesis www.bell-labs.com/project/tts
- Dotplot www.cs.unm.edu/jon/dotplot
10Lots of Neat Data on the Web
- Wordnet www.cogsci.princeton.edu/wn
- Linguistic Data Consortium (LDC)
- www.ldc.upenn.org
- SIGLEX www.clres.com/siglex.html
- Discourse Resource Initiative (DRI)
- www.georgetown.edu/luperfoy/Discourse-Treebank/dri
-home.html - The Federalist Papers
- www.mcs.net/knautzr/fed
11More Neat Data on the Web(in Lots of Languages)
- Chinese
- rocling.iis.sinica.edu.tw
- www.sinica.edu.tw
- Japanese cl.aist-nara.ac.jp/lab/resource/resource
.html - Electronic Dictionary Research (EDR)
www.iijnet.or.jp/edr - Advanced Telecommunications Research (ATR)
www.atr.co.jp - www.rdt.monash.edu.au/jwb/japanese.html
- Korean korterm.kaist.ac.kr
- European Language Resources Association (ELRA)
- www.icp.grenet.fr/ELRA
- Parallel Text (Resnik, ACL-99)
- Canadian Hansards WWW.Parl.GC.CA
- Turkish www.nlp.cs.bilkent.edu.tr
- Swedish svenska.gu.se
12Lots of Neat Tools on the Web
- Penntools (links to all over the world)
- www.cis.upenn.edu/adwait/penntools.html
- Part of Speech Taggers (see above)
- Juman/Chasen
- pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html
- cl.aist-nara.ac.jp/lab/nlt/chasen.html
- Suffix Arrays
- http//cm.bell-labs.com/cm/cs/who/doug/ssort.c
13Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
14Shared Resources Evaluation
- Common tasks
- Trec (trec.nist.gov), Tipster, MUC
- Common benchmark corpora Brown, Penn Treebank,
Wall Street Journal, Switchboard - Shared lexical resources Wordnet
(www.cogsci.princeton.edu/wn/) - Common labeling conventions/standards in all
areas of NLP from Speech to Discourse - Evaluation, evaluation, evaluation
- Required to get a paper accepted anywhere.
15In 1993, it wasnt like this...
- Invited talks at ACL-93
- Planning Multimodal Discourse
- Transfers of Meaning
- Quantificational Domains and Recursive Contexts
- Less sharing of resources
- Evaluation not required
16Empiricism vs. Rationalism
- Pluses Clear measurable progress
- Speech Recognition
- Part of Speech Tagging
- Parsing
- Minuses Herd mentality, incrementalism, mindless
metrics, duplicated effort - Recall empiricism fell out of favor in 1960s
when methodology became too burdensome.
17Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
18Main StreetBig change since 1993
- Large corpora are now having an impact on
ordinary users - Web search engines/portals
- Managing gigabytes, not just a popular book, but
something that ordinary users are beginning to
take for granted.
19Huge Commercial Successes(Since 1993)
- Information Retrieval Digital Libraries
- Web search engines/portals highly successful on
both Wall Street as well as Main Street - Invited talks from Lycos (1997) Infoseek (1998)
- Machine Translation Speech
- Available wherever software is sold
- Cant use a phone without talking to a computer
20Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
21How Large is Very Large?
22Mirror, mirror on the wall
- Who is the largest of them all?
- The Web?
- Lexis-Nexis?
- West?
- We have had invited talks from all three
- Web Lycos (1997) Infoseek (1998)
- Lexis-Nexis (1993)
- West (1997)
23Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
24Internationalization
- SIGDAT-93 Nearly equal participation
- America 4 papers
- Asia 4 papers
- Europe 3 papers
- Great growth in activity around the world,
especially Asia - SIGDAT has met in a dozen cities (50 in America)
- America Columbus, Cambridge, Philadelphia,
Providence, Montreal, College Park - Asia Kyoto, Beijing, Hong Kong
- Europe Dublin, Copenhagen, Grenada
25Some Topics that are Behind the International
Expansion
- Classic Issues
- Machine Translation (MT) / Tools
- Input Method Editor (IME) MS-IME98
- Morphology Juman, Chasen
- New Issues
- Cross-language Information Retrieval (CLIR)
- Browsing the Internet integrate IME CLIR MT
- Parallel and comparable corpora
- Terminology Extraction Alignment
- Suffix Arrays
26Big Changes Since 1993
- The Web, stupid!
- Demos
- Data
- Research
- Shared resources evaluation
- Scale How large is very large?
- Increased breadth Geography, Topics
- Commercial Wall Street Main Street
27Broader (and More Applied) View of Computational
Linguistics
- Data-mining, Databases, Data Warehousing
- Digital Libraries
- Information Retrieval, Categorization, Extraction
- Lexicography
- Machine Learning
- Machine Translation
- Speech
- Text Analysis
28Data-Mining Issues(How Large is Very Large?)
- Similar technology to corpus-based methods
- But much larger datasets
- Newswire (AP) 1 million words per week
- Telephone calls 1-10 billion per month
- IP packets expected to be even larger
- Tasks Fraud, Marketing, Operations, Care
- Identify knobs that business partners can turn
- Increase demand (buy TV ads, reduce price)
- Increase supply (buy network capacity, enhance
operations) - Target opportunities for improvement (marketing
prospects) - Track market response in real time (supply/demand
by knob)
29Best of SIGDAT
- Best Invited Talk
- Work of Note
- Work of Note (in Related Fields)
30Best Invited Talkat a SIGDAT Meeting
- Henry Kucera and Nelson Francis
- Third Workshop on Very Large Corpora (1995)
- Massachusetts Institute of Technology (MIT)
- Cambridge, MA, USA
- Described their work on the Brown Corpus
- At a time when empiricism was out of fashion
- especially at MIT
- Personal Touching (received standing ovation)
31Work of Note
- Statistical Machine Translation / Alignment
- Brown et al.
- Statistical Parsing (In 1993, poor use of lexical
info) - Jelinek, Magerman, Charniak, Collins
- Statistical PP Attachment
- Hindle and Rooth
- Word-sense Disambiguation
- Yarowsky
- Text-tiling (Discourse Parsing)
- Hearst
32Work of Note(in Related Fields)
- Learning
- Classification and Regression Trees (CART)
- Riper
- Web Tools
- Managing Gigabytes, Harvest, SGML ? XML
- Representation
- Suffix Arrays
- Latent Semantic Indexing
33SummaryReaching a Wider Audience
- Commercial Successes
- Main Street Wall Street
- Internationalization
- Goal equal rep from America, Asia Europe
- More topic areas
- Information Retrieval, Speech, Machine
Translation, Machine Learning, Data-mining
34Self-organizing vs. EDA
- Self-organizing Learning, HMM
- Statistics do it all
- Manual
- Wilks Stone Soup Statistics dont do nothing
- Exploratory Data Analysis (EDA)
- Hybrid of above
35Time for a little controversyTwo types of
Empiricism
- New Linguistic Insights vs. Methodology
- Reviewers do what reviewers do
- Safe, conservative, seek precedents, case law
- Reviewers go easy on methodology papers
- Grim historical reminder
- Recall empiricism fell out of favor in 1960s
when methodology became too burdensome. - Shouldnt let the methodology get in the way of
what we are here to do.